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GENERAL  CONVERGENCE  RESULTS  FOR  STOCHASTIC 


APPROXIMATIONS  via  WEAK  CONVERGENCE  THEORY 


ABSTRACT 


The  paper  treats  general  convergence  conditions  for  a 
class  of  algorithms  for  finding  the  minima  of  a function 
f (x)  when  f(x)  is  of  unknown  (or  partly  unknown)  form  - and 
when  only  noise  corrupted  observations  can  be  taken.  Such  pro- 
blems occur  frequently  in  adaptive  processes,  and  in  many  appli- 
cations to  statistics  and  estimation.  The  algorithms  are  of  the 
stochastic  approximation  type.  Several  forms  are  dealt  with  - 
for  estimation  in  either  discrete  or  continuous  time,  with  and 
without  side  constraints,  and  with  or  without  - periodic  search 
renewal.  The  algorithms  can  be  considered  as  sequential  Monte 

Carlo  methods  for  systems  optimization. 

partly 

The  innovations/ concern  the  method  of  proof.  However,  an 
interesting  'constrained'  and 'renewed'  algorithm  are  also  con- 
sidered. By  using  ideas  from  the  theory  of  weak  convergence  of 
probability  measures,  we  can  get  relatively  short  proofs,  under 
much  weaker  conditions  than  heretofore  required.  For  example, 
the  noise  can  be  correlated,  and  there  are  fewer  restrictions  on 
the  step  size.  Furthermore,  the  nature  of  the  method  permits 
generalizations  to  more  abstract  cases  (which  occur  for  example, 
if  we  are  optimizing  a distributed  parameter  system) . 

The  results  can  be  extended  in  many  directions  and  varia- 
tions of  the  technique  can  be  used  to  get  bounds  on  rates  of  con- 
vergence. Special  forms  of  the  method  can  be  applied  to  many  well 
known  ' adaptive ' procedures . 


1.  Introduction.  Using  results  in  the  theory  of  weak  convergence 


of  measures  (Billingsley  [1]),  and  in  stability  theory  for 
ordinary  differential  equations,  we  prove  some  general  convergence 
theorems  for  the  sequences  of  random  variables  which  are  generated 
by  algorithms  of  the  stochastic  approximation  type  (see  e.g., 

Kiefer  and  Wolfowitz  [6],  Blum  [2],  Schmetterer  [15]),  Fabian  [5], 
Wasan  [16] . Such  algorithms  are  used  when  one  wishes  to  locate, 
via  a recursive  Monte-Carlo  method,  a minimum  of  a function,  under 
the  handicap  of  "noisy"  data.  Algorithms  for  both  constrained  and 
unconstrained  optimization  problems  will  be  considered,  and  for 
rather  general  noise  processes. 

The  noise  processes  that  we  allow  seem  to  be  more  typical 
(than  those  usually  considered  in  the  stochastic  approximation 
literature)  of  the  type  usually  found  in  applications  to  sequential 
parametric  optimization  of  control  systems.  Some  work  concerning 
noise  processes  where  the  usual  orthogonality  condition  is  not 
satisfied  was  reported  in  [3],  but  under  very  special  assumptions, 
and  for  a one  dimensional  Robbins-Munro  process  only. 

Remarks  on  a weak  convergence.  The  space  Dm  (-<*>,'»)  of  Rm 
valued  functions  on  (-00,00)  which  are  right  continuous  and  have 
left  hand  limits,  is  most  convenient  for  our  purposes.  The  space 
D^[0,“)  was  discussed  by  Lindvall  [11]  in  detail.  Let  g^(*) 
denote  the  function  on  [0,°°)  defined  bys  g^(t)*l,  t<j, 
gj  (t)  = j + 1 - t on  [j,j+l],  and  g^U)  * 0,  t > j ♦ 1. 

For  any  function  x(*)  in  D [0,°°),  define  xJ(*)  by 

x3  (t)  = x(t)g_j(t).  Lindvall  proceeds  roughly  as  follows.  Let 


2. 


dQ  j(x(*)»y(*))  denote  the  Skorokhod  dQ  metric  on  D[0,j+1] 
(See  Billingsley  p.  Ill  on,  where  j = 1) . The  metric 


d (x ( • ) ,y(*) ) = l 
j = l 


is  defined  on 


D1  [0 


°°)  • 


d.  . (x3 (• ) ,y3 (* ) ) 

- 2 3 

1 + dQ,  j (x  ( • ) 3 , y ( * ) 3 ) 


Under  this  metric,  the  space  is  complete 


and  separable.  Tightness  (see  Billingsley  [1]  for  the  definition 

of  tightness)  of  a sequence  {Xn(*))  of  random  functions 

with  paths  in  D^[0,°°)  is  equivalent  to  tightness  of  (X3(*)} 

on  [0 , j+1]  , for  each  j > 1.  To  get  D^(-®,“)  , as  indicated 

by  Lindvall,  we  simply  symmetrize  the  definition  of  g ^ ( • ) 

Simply  define  gj(*)  on  (-00,00)  by  g^  (t)  = g ^ ( 1 1 1 ) and 

replace  g^  by  g^  in  the  definition  of  x3,  and  we  use  the 

above  metric  d (*,•)»  where  now  dn  .(*,•)  denotes  the  Skorokhod 

u r J 

dp  metric  on  D"*"  [- j-1 , j+1]  . We  use  Dm(-°°,“)  to  denote  the 
m fold  product  of  D^"  ( — 00 , 00 ) . 

While  our  processes  will  be  Dm(-°°,°°)  valued,  the 
limits  will  generally  (except  for  one  case)  have  continuous  paths 
w.p.l.  It  follows  from  Billingsley  [1]  , Theorem  15.5,  that,  if 

{ ( • ) } is  a sequence  of  processes  with  paths  in  Dm[-T,T]  for 

some  T > 0,  and  if  (1)  and  (2)  hold,  then  is  tight  on 

Dm[-T,T]  and  any  separable  process  which  is  a limit  in  distribution 
of  a subsequence  of  {Xn(*)}  has  continuous  paths  w.p.l.  Hence, 

if  (1)  and  (2)  hold  for  each  T,  where  6 and  n^  can  depend 

on  T,  then  the  assertion  of  the  last  sentence  holds  for 

_m . . 

D (-■»,<»)  . 


For  each  r\  > 0,  e > 0,  there  is  an  < °°,  a fie  (0,1) 
and  an  Integer  nQ  < °°  so  that 

(1)  P{|Xn(0) | > Nn>  < n,  n > 1 

(2)  P { sup  |Xn(t)  - Xn(s)  | > e}  < n,  n - Ho- 

lt-si <6 

T>t>S>-T 

We  note  that  if  {Xn(*)>  is  tight  on  Dm (-«,«)  , and  if  Xn(t)  -*■  0 

in  probability  as  n -*■  °°,  for  each  t e (-<*>,<»)  , then  Xn(*) 
converges  to  the  zero  element  of  Dm (-«,»). 

We  will  also  need  the  following  result  from  the  stability 
theory  of  ordinary  differential  equations  (LaSalle  and  Lefschetz 
I10J),  Let  g(*)  denote  a continuous  Rr  valued  function 
on  R and  let  X(*)  denote  a bounded  function  on  either  the  interval 
(-oo,oo)  or  on  [0,°°)  which  satisfies 

X(t)  = g (X (t) ) . 

Then  all  limit  points  (as  t •+■  °°)  of  X(*)  are  contained  in  the  largest 
finite  invariant  set  of  the  differential  equation  y = g(y), 
where  an  invariant  set  M is  defined  as  follows.  Let  y e M, 
then  there  is  a continuous  Rr  valued  path  {y(t),  «>  > t > -«>}, 
which  satisfies  y(0)  = y,  y(t)  = g(y(t)), 

and  y(t)  e M,  all  t e (-00,00)  . The  use  of  the  doubly  infinite 
interval  is  especially  valuable  in  applications  - it  facilitates, 
as  will  be  seen,  the  treatment  of  stationarity  or  of  asymptotic 
problems.  Also,  if  X(t)  tends  to  a set  N as  t then 


4. 


it  tends  to  the  largest  invariant  set  contained  in  N,  as 
t 00  (which  may  be  much  smaller  than  N itself  ) . This  result 
is  very  useful  in  the  stability  analysis  of  differential  equations. 
See  LaSalle  and  Lefschetz  (10] . 

Section  2 will  treat  an  unconstrained  problem,  and  in 
Section  3,  we  discuss  several  extensions.  The  results  are 
generalizations  of  those  obtained  by  Ljung  [12] . The  approach  taken 
here  is  somewhat  different.  Our  method  uses  rather  different 
techniques  and  seems  to  be  simpler.  Also,  owing  to  the  features  of 
the  method,  it  seems  that  results  can  be  obtained  for  a greater 
variety  of  problems,  and  also  for  stochastic  approximations  with 
abstract  valued  random  variables. 


2 . The  Unconstrained  Problem  Classical 

Stochastic  Approximation.  Let  us  as sume 


(Al)  {a^}  is  a (possibly  random)  positive  null  sequence 

satisfying  T a = ». 

n n 

(A2 ) f(*)  is  a continuous  real  valued  function  on  R , 

bounded  from  below  and  continuously  differentiable  (whose 
gradient  is  denoted  by  f (•)).  Each  solution  to  the  differ- 
ential equation  x = -fx(x)  on  (0,°°)  is  bounded. 

(A3)  Let  S denote  the  set  of  points  {x:  fx(x)  = 0}. 
Assume  that  S is  bounded v and  connected. 


(A4)  {6n> 

variables . 

Define  t 

n 

max { n : t < t } . 
n — 


is  a null  sequence  of  Rr  valued  random 

n-1 

by  tn  = 0,  t = £ a.  and  write  m for 

^ **  -l  — A 1 ^ 


5. 


(A5)  (£n)  is  a sequence  of  Rr  valued  random  variables 


which  satisfy  (3) . 


lim  sup 
N 


max 

0<t<T 


i=m(tN+t)-l 

l ai^i  > = 0 

i=m(tN)  I J 


(3) 


' 

m(tN)-l 

’ 

lim  sup  P< 

max 

I a . C • 

i=m(t  +t)  1 1 

> ef> 

N 

-T<t<0 

, each  T > 0, 

e > 0. 


> = 0 


Condition  (A5)  is  rather  weak,  and  it  will  be  discussed 
further  below.  The  iteration  for  many  stochastic  approximation 
algorithms  for  the  location  of  a minimum  of  a regression  function 
f(-)  can  be  represented  in  the  following  form 


(4) 


n+1 


= Xn  - a„f„(XJ  + a„g„  + a„£„. 


n x n 


n n 


n’n 


Remark  on  the  Kief er-Wolfowitz  procedure.  Let  us  consider  one 
particular  case,  that  of  the  Kief er-Wolfowitz  procedure.  Suppose 
that  for  each  x e Rr,  H(*|x)  is  a distribution  function  of  a real 
valued  random  variable  and  that  / y H(dy|x)  = f (x) , and  suppose  that 
Xn  is  a given  random  variable.  Let  c denote  a positive  null 


n 


. th 


sequence  and  e^  the  unit  vector  in  the  i coordinate  direction.  Let 
Y y j = 1,...,  denote  a sequence  of  random  variables  obtained  as 

follows.  Y2i  and  Y2i-1'  i = 1»***'r»  are  random  draws  with  law 
{regular  conditional  probability,  given  XQ)  . H2i  ( • | Xg+e.^Cg)  and 
H2i-l(*  iKo-e^)  , resp.  Define  the  itfl  component  of  X^  by 


I 

1 


-- 


6. 


X1  = X0  ~ an 


[Y2i-Y2i-l] 

2c_ 


In  general,  suppose  that 


,X  are  available.  Then  let  Y2n+2i 


and  Y2n+2i  1'  resP* • be  random  variables  with  laws  (regular 
conditional  distributions)  H2n+2i ^ * I Xn+eicn^  and 

H„  _ . . ( • I X -e . c ) , resp.,  and  define  the  i*"*1  component  of  X . by 
2n+2i-l  1 n in 


„i  „i  _ lY2n+2i_Y2n+2i-l] 

Xn+1  = Xn  ' an  2 5^  * 


If  f(*)  has  continuous  second  derivatives,  then  we  can  write 


Y2n+2i  ” Y2n+2i-l 
2c 


" fx.(V  + cn^n, i + ?n' 


where  we  define  3'  . by  the  equality 

n , i 


f (X  ) + c 6 • 

x.  n n n,i 


f<Vei°n>  • £(Vei°n) 


2c 


and  f (•)  is  the  derivative  with  respect  to  x..  If  the 

xi  1 

second  derivatives  of  f(*)  are  bounded,  then  so  are  the  3'  . 

n,i 

The  term  is  the  "observation"  noise,  and  (Xn)  satisfies 

(4)  where  3 is  the  vector  with  components  c 3'  . Of 

n n n , l 

course,  the  recursion  (4)  can  be  obtained  in  many  ways.  We  do 
not,  contrary  to  the  usual  practice,  assume  that  the  {£.}  are 
orthogonal. 

In  many  examples,  the  distribution  Hn(*|x)  is  replaced 


by  a distribution  H^(*|x)  of  a Rr  valued  random  variable,  where 
/ yH^(dy|x)  = Then  the  Kief er-Wolf owitz  form  is  still  (4), 

but  with  gn  = 0.  With  either  regression,  the  definition  of 
implies  that  E[£n|  Xn]  = 0 w.p.l.  However,  in  many  practical  cases, 
one  uses  a recursion  like  (4) , but  where  the  {£n>  constitute  a 
sequence  of  actual  observation  errors,  which  can  not  be  repre- 
sented in  terms  of  the  laws  such  as  H(*jx)  or  H'(*|x). 

Various  examples  appear  in  Ljung  [12,  13],  and  the  references 
contained  therein.  Here,  we  work  with  (4) , and  no  characteriza- 
tion of  {£  } is  involved. 

Discussion  of  condition (A5) . Suppose  that  there  is  a real 
sequence  {a^}  such  that  the  } satisfy  the  usually  imposed 

condition  in  stochastic  approximation 


EUn+1|  ?i'  1 - n'  ai  1 - n + 1;  V = ° W'P*1* 

Etl?n+l|2  l*i»  i < n;  aif  i < n + 1;  XQ]  < a*+1,  w.p.l. 


Then  to  verify  (3),  we  use  the  martingale  property  of  the  partial 
sums  of  the  and  <?et 


E max 
0<t<T 


m(tN+t) -1 

l 

i=m(tN) 


a.  5. 


m(tN+T)-l 


< 4E 


i=m(tN) 


2 2 
ai°i 


m(tN-t)-l 

l a.q 

i=m(tN-T) 


"Y'1  2 2 

: 4E  l a. of. 

i=m(tN-T) 


E max 
0<t<T 


8. 


Then  (3)  holds  if 


1 

J 


sup  a o + 0,  as  i -»■ 
. . n n 
n>i 


Condition  (6)  holds  if  {o^}  is  uniformly  bounded  (e.g.,  if 

2 

H ' ( • | x)  is  used).  If  H(*|x)  is  used,  then  on  is  inversely 

2 o 

proportional  to  cn»  Suppose  that  there  are  real  A,C,y,ct ,o 


such 


that  a < An'a  = a2c  2 < a2C2n-2Y  and  a > 2y, 
n - ' n n - ' ' 


then 


(6)  also  holds. 


Now,  let  us  drop  (5) . Suppose  that  the  {a^}  are  real 

numbers.  (If  {a^}  are  random  variables,  then  we  can  still 

continue  the  argument  by  use  of  suitable  upper  bounding  of  the 

{a^}.)  For  notational  convenience  suppose  that  the  are 

scalar  valued.  (Otherwise  we  treat  each  component  separately.) 

Define  = r^^.  Suppose  that  there  are 

functions  R. . such  that  R. . = R..  and 
ID  ID  Di 


rijk)J  - Ri  jRk£  + RikRjJl  + RiX,Rkj* 


Condition  (3)  holds  if  there  is  a function  R(*)  such  that 

Rij  - RUi“D|)  and  R(n)  -*  0 as  n °°.  We  omit  the  argument,  for 

it  can  be  seen  from  the  following  more  general  case,  where  the 

2 

covariance  matrix  of  is  inversely  proportional  to  c^. 

Assume  that 


Rij  < R(|i-j |)/c.ci. 


(7) 


■FTT 


( if 


9. 


It  is  implied  by  Theorem  15.3,  and  the  proof  of  Theorem  12.3  of 
Billingsley  [1]  that  (3)  will  hold  if  there  is  a real ^sequence 


{Kn>  such  that 


(8) 


m(tN+t)-l 


l a. 

L l i 


m(tN-s) 


* KNlt+s  + am(tN-s)-ll  ' 311 


fcN  > s > 0, 


where  K ■*  0 as  N -*  00 . The  left  side  of  (8)  is  bounded 
N 


above  by 


1 a.a.a.  a.(R..R,  „ + R..R..  + R.  .R,  .). 

i,  j ,k,  9 i 3 k SLy  ij  k£  lk  i£  Tc] 


(The  summation  for  each  index  is  from  m(tN~s)  to  m (tN+t)-l. 
The  limits  will  be  omitted  for  notational  simplicity.) 


Consider  one  term  only,  and  note  that 


(9) 


l a.a.a.  a R.  .R  < Y a. a.  — ^-1  ^ 3 1 ^ 
■:  j i.  . i ] k £ i]  kJ  - .%  i i c.c. 

1 /3  /*/*-  J J j_i  1 3 


2 i a. a.  RiliiiLT2 
. . in  c.c. 

1 3 _ 

0**3 


Suppose  that 


lim  sup  a^/c .c . = 0. 
i+oo  j>i  J J 


and  that  R(n)  is  summable.  Then 


(10)  l a . R ( | i-j  | )/c.c  . -*•  0 

j>i  J J 


T-.  Tit 


- 


r 


i <. 

u 


1 


10, 


as  i -*■  00  . This  implies  that  there  is  a of  the  required 

type  such  that 


m(tN+t) -1 


m(tN+t)-l 


I a±a.R(  | i— j | )/c.c.  < (^)1/2  l a. 

j,i=m(tN-s)  J J i=m(tN-s) 


which  implies  (8) . 


Define  processes  X°(t),  F°(t),  B°(t),  E°(t)  on  ( — ,00)  by 
X (t)  = Xq  , B (t)  = E (t)  = F (t)  = 0 for  t < 0,  and  for  t > 0, 


m ( t) -1 

X°(t)  = Xn  on  ttn/tn+i)  ' F (fc>  = aifx(Xi} 

m(t)-l  m(t)-l 

B°(t)  = l a.?.,  E°(t)  = I aiq. 

• _ r\  -L  -L  i — r\  -1- 


i=0 


i=0 


Then 


(ID 


X°(t)  = X°(0)  - F°(t)  + B°(t)  + E°(t),  t e (-°°,°°)  . 


Theorem  1 . Assume  (Al)  to  (A5) , and  that 


(12) 


sup  | Xn  | < 00  w . p . 1 . 
n 


Then  {X^}  tend  to  S in  probability,  as  n -*•  More  strongly, 
for  each  T < 00  and  e > 0, 


4* 


I 


(13)  P{  sup  dist.  [X°  (tN+s)  ,S]  > e)  -*■  0 as  N 
-T<s<T 


t t- 


J 


11. 


I 

• i 

(dist(x,S)  denotes  the  Euclidean  distance  between  x and  the 

set  S)  . *1 

1 


Remarks  on  the  theorem.  The  condition  (12)  can  often  be  verified 
by  independent  means.  In  practical  problems  some  restriction 
would  be  put  on  the  maximum  values  anyway.  We  get  only  convergence 
in  distribution-or  the  somewhat  stronger  result  (13) , while  the 
classical  results  concern  convergence  w.p.l.  In  fact,  under 
our  conditions,  convergence  w.p.l  is  not  always  possible.  But  the 
conditions  here  are  relatively  weak. 

Proof . By  (11) , 

(14)  X°(tN+s)  = X°(tN)  - [F°(tN+s)  - F°(tN)] 

+ [B°(tN+s)  - B°(tN)]  + E°(tN+s)  - E°(tN)]. 

The  idea  of  the  proof  exploits  the  fact  that  the  convergence  of  {an} 
to  zero  implies  an  increasing  "compression  of  discrete  time  to 
continuous  time"  in  the  functions  in  (14)  , as  N We  expect  that 

this  "compression"  and  (A4)  and  (A5)  will  "eliminate",  asymptotically, 
the  effects  of  B°(*)  and  E°(*)  relative  to  the  effects  of  D° ( • ) . 
We  define  a sequence  of  left  shifts  so  that  the  "asymptotic"  part 
of  the  functions  remain  accessible  for  an  analysis  using  weak  con- 
vergence and  differential  equations  methods.  For  each  integer 
N > 1 define  the  processes  XN  ( • ) , FN  ( • ) , EN  ( • ) , BN  ( • ) on  (-00,00) 


| 

I 
• I 

- » 

:i 

j 

i 

:i 

i 

1 

.1 

i 

i 

a 


by 


‘ i 


a 


XN(s)  = X°(tN+s),  XN(0)  = x°(tN) 

FN(s)  = F°(tN+s)  - F°(tN) 

EN(s)  = E°(tN+s)  - E°(tN) 

BN(S)  = B°(tN+s)  - B°(tN). 


Then,  (14)  can  be  written  as 


XN(s)  = XN(0)  - FN(s)  + BN(s)  + EN(s). 


The  fact  that  Ew(s)  ->  0 in  probability  as  N + 0, 
for  each  s,  follows  from  (A5)  . Similarly,  Bn(s)  0 in 
probability  as  N -*  °°  for  each  s,  by  (A4)  and  the  representation 


m(tN)-1 


m(tN+s)-l 


Bn(s)  = - l a . g . , s < 0,  BN(s)  = l a.g.,  s > 0, 

\ 1 J-  — / X-  \ -LX 


m(tN+s) 


m(tN) 


and  the  fact  that  the  sums  of  the  ai's  over  the  above  ranges  is 

- [s+am(tN+s)-l]  (s  < 0)  and  - [s+am(tN)-l]  = [s+aN-l]  for 
s > 0. 

The  sequence  N = 1,2,...}  = {Xn(*)fFN(*),B  (•), 

EN(*),  N = 1,2,...}  has  paths  in  D4r(-°°,°°),  and  we  will  now 
prove  tightness  and  that  the  process  which  is  the  "limit”  of  any 
subsequence  which  converges  in  distribution  has  continuous  paths 
w.p.l.  We  need  only  verify  (1)  and  (2). 
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Criterion  (1),  (2)  hold  for  {F  ”(•)}  by  (12),  the  continuity 
of  fx(*),  and  the  fact  that  FN(0)  = 0.  The  criterion  holds  for 
{BN(*)},  by  (A4)  and  the  fact  that  BN(0)  = 0.  Condition  (A5) 
and  the  fact  that  EN(0)  = 0,  imply  (1),  (2)  for  { EN ( - ) } . 

Finally  (XN(o)}  = (X^  satisfies  (1),  by  (12).  This  last  fact, 
together  with  the  fact  that  the  (FN(*)),  (BN(*)}  and  { EN ( - ) } 
satisfy  (2)  for  each  T < 00 , implies  that  (X  (•))  satisfies  (2) 
for  each  T < °°.  Thus  ($N(*)}  is  tight  on  D^r(-°°,°°),  and  any 
process  which  is  the  limit  in  distribution  of  a convergent 

subsequence  has  continuous  paths  w.p.l. 

Let  N index  a subsequence  of  {4>N(*)}  which  converges 
in  distribution,  and  let  $(•)  = (X( • ) ,F ( • ) ,B ( • ) ,E ( • ) ) denote  the 

"limit".  Then  B(t)  = E(t)  = 0.  Since 

(18)  XN(t)  = XN(0)  - f(XN(s))ds  + functions  tending  to 

0 

the  zero  function  in  distribution. 


we  must  have 


(19)  X(t)  = X(0)  - F (t)  = X(0) 


rt 

- Jof* 


(X (s)  )ds,  t e (—oo , oo) 


Since 


lim  sup  P{  sup  |XN(t)|  > M} 
N oo>t>~  °° 


lim  P{  sup  |X  | > M} 
M -*■  00  °°>n>0  n 


the  weak  convergence  implies  that 
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P{  sup  | X ( t ) | < <*>}  = 1. 

oo>t>-oo 

The  only  solutions  to  (19)  which  are  uniformly  bounded  on  (-00,00) 
must  be  in  the  largest  finite  invariant  set  of  x = -f^fx).  But, 
since  f ( • ) is  the  gradient  of  a function  that  is  bounded  from 
below,  this  invariant  set  is  just  S.  Thus  X(0)  e S.  Hence 
XN(0)  = XN  -*■  S in  distribution. 

Let  T denote  a positive  real  number.  Since  the  function 
q(*)  on  D (-00,00)  which  is  defined  by 


J 

I 

I 


q(y(‘) ) = min[l,  sup  distance (y (s) ,S) ] 

-T<s<T 

is  continuous  on  D ( -»,<*>)  w.p.l,  relative  to  the  measure  in- 

duced  on  D (-00,00)  by  the  continuous  process  X(*),  the  weak 

N 

convergence  implies  that  the  distribution  of  q (X  (•))  must 
converge  to  that  of  q(X(*)).  Clearly,  the  subsequence  is 
irrelevant,  since  S does  not  depend  on  the  subsequence  and 
q (X ( * ) ) = 0 for  any  limit  X(*).  Thus  (13)  holds.  Q.E.D. 


3.  Extensions . Several  interesting  extensions  will  be  indicated. 
They  are  suggestive  of  additional  applications.  The  details  are 
similar  to  those  concerned  with  Theorem  1,  and  only  sketches  will 
be  given 

(a)  Continuous  parameter  stochastic  approximations.  Let 

a(*)  denote  a real  valued  positive  measurable  function  on  [0,°°), 

• 00 

and  suppose  that  a(t)  -*•  0 as  t -*•  °°  and  a(s)ds  = °°.  Assume 
(A2)  and  (A3),  and  let  8(*)  denote  a Rr  valued  measurable 


process  on  [0,°°)  which  tends  to  zero  as  t -►  Let  £(•)  denote 
a measurable  Rr  valued  process  on  [0,°°).  Define  the  stochastic 
approximation  (20) 


(20)  X ( t)  = X (0)  - f a (s)  f (X (s) ) ds  + I a(s)$(s)ds  + I a(s)g(8)ds. 

Jo  J 0 Jo 


Such  continuous  time  procedures  have  been  discussed  by  Sakrison 
[14]  and  others. 

Define  the  time  transformation  t ( - ) by 

-T 

t = a (s) ds , 

J0 


and  define  the  function  Y°(*)  by  Y°(t)  = X(x(t)).  Then 


Y°(t)  = Y° (0)  - ftfv(Y°(s))ds  + 
J0  x 


a (s)  3 (s)  ds 


rx(t) 

+ a(s)C(s)ds. 


For  each  real  T > 0 define  the  functions  YT ( • ) ,FT ( • ) ,E^ ( • ) and 


B1  ( • ) on  (-“,«>)  by 


YT(t)  = y° (T+t) , 


x rT+t  ,p 

F1  (t)  = f (Y°(s)ds,B1(t) 

J T x 


■n 


T (T+t) 


a (s)  3 (s)  ds 


ET(t) 


fT (T+t) 

= a (s)  £(s)dsf  oo  > t > -T. 

Jx(T) 


At  t < -T,  let  the  functions  take  their  values  at  -T.  Then 
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(21)  YT(t)  = YT(0)  - FT(t)  + BT(t)  + ET(t), 


which  is  just  the  continuous  parameter  version  of  (16) . Assume 
that 


(22)  sup|x(t)  | < °°  w.p.l. 

t>0 

The  sequence  {YT  ( • ) ,FT  ( • ) ,BT  ( * ) ,ET  ( • ) , T < °°}  has  its  paths  in 

C^r  (-<»,<»)  , the  space  of  Rr  valued  continuous  functions  on  (-°°,°°)  . 

4 IT 

We  can,  of  course,  assume  that  they  are  in  D (-“,“)  and  verify 

criterion  ( l)-(  2)  on  each  interval  [-T,T]  . 

T 

The  sequence  {B  (•)}  is  tight  and  asymptotically  degenerate. 

Also  {FT ( - ) } is  tight,  by  (22)  and  the  continuity  of  f ( • ) . 

T 

If  {E  (•)}  were  tight  and  asymptotically  degenerate,  then 
arguments  parallel  to  those  used  in  the  proof  of  Theorem  1 would 
yield  that  if  Y(*)  is  any  limit  in  distribution  of  (YN(*)},  then 


Y ( • ) must  satisfy 


Y(t)  = -fx(Y(t))  , 


and  Y (t)  e S,  all  t e (-00,00)  , and  also  that 


X (t)  -*■  S , t -*■  °°,  and 


P{  sup  dist-(X(T+s)  ,S)  > e } -*■  0,  T-*-°°,  for  any  t > 0. 
-t<s<t 
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T 

Tightness  of  {E  (•))  is  implied  by  (1)  and  (2)  (for 
T 

E (*)  replacing  Xn(*)  which,  in  turn,  is  implied  by  a 
continuous  parameter  analog  of  (3) , where  the  sum  is  replaced  by 
an  integral,  tN  by  T and  m(u)  by  x (u)  , for  each  u e . 

For  example,  suppose  (for  notational  convenience  only)  that  £(•) 
is  scalar  valued  and  that 

|E5 (vx) c (v2) c <v3) c (v4) I < r (v1,v2,v3,v4) , 


where  r(*)  satisfies 


r (v1,v2,v3,v4)  < R(v1,v2)R(v3,v4)  + R(v1,v3)R(v2,v4) 

+ R(v1,v4)R(v2,v3)  , 

and  that  there  are  positive  measurable  functions  c(*)  (a  finite 
difference  interval)  and  R(*)  such  that 

R(Vi,v2)  < R( lv1"v2 | )/[c(v1)c (v2) ] 


Suppose  that  there  is  a sequence  KT  -*•  0 as  T 
t,s  with  t > s. 


(23) 


fT|T+s)I  c~(v ) c (w)  R<|v-w|)<avdw  < K^lt-i 


Then  we  can  show  that  there  is  a real  number  K 


E|ET(t)  - ET ( s ) | 4 < K.K  | t-S | 2 , 


■*  00  such  that  for  each 


such  that 


which  implies  both  tightness  and  asymptotic  degeneracy  of 
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(EN(»)}.  Equation  (23)  does  not  seem  to  be  particularly 
stringent  in  applications. 

(b)  A nondegenerate  stochastic  approximation.  Another 
interesting  application  of  the  ideas  of  the  last  section  occurs 
if  we  wish  to  design  a procedure  to  find  an  actual  absolute 
minimum  of  f(.),  rather  than  simply  a stationary  point.  In 
order  to  prevent  the  stochastic  approximation  procedure  from 
degenerating  to  a stationary  point,  and  to  continue  the  search 

until  "all"  local  minima  are  found,  we  may  force  the  process  to 

jump  to  a new  "starting  position*' every  once  in  a while. 

This  is  not  necessarily  the  best  procedure,  but  is  easy  to  implement 
and  often  discussed.  Suppose  that  g(*)  is  a bounded  continuous 
function  on  Rr  which  is  zero  for  large  x,  let  b > 0 be  a real  numbei 
and  let  Q(*)  denote  a distribution  function  on  Rr,  with  support  in 
a bounded  set.  Let  jn  denote  a sequence  of  Rr  valued  random 
variables,  such  that  P{jn  = 0|  i < n;  j^,  i < n}  = 1 - anb  + 0(an> 

and  P{jn  e A|  jn  f 0;  i < n;  ai,  i < n = Q (A)  . Define 

m (t) -1 

J (t)  = l j.,  J (t)  = J(t+tN)  - J(tN),  and  the  iteration 
i=0  1 

I 


(24) 


n+1 


= X - 
n 


anfx(V 


+ a B + a r + g (X  ) j . 
nn  nsn  yvn/Jn 


Let  J(*)  denote  a Poisson  jump  process  with  infinitesimal 
jump  probability  bdt,  and  jump  distribution  Q(*),  and  defined 
on  (-<*>,«).  Under  (Al)  - (A5)  and  (12),  a generalization  of  the 

method  of  proof  of  Theorem  1 yields  that  XN(*)  tends  in  distribution 
to  a process  X(*)  which  solves 


X(t)  = X(0) 


- f f (X(t))ds  + f g(X(s"))dJ(s),  te  (-»,<»), 
J0  J0 


and  which  is  uniformly  bounded  (pathwise)  on  (-00,00)  . In  (24)  , or 
(25),  each  time  there  is  a jump,  the  search  is  "renewed",  in  a 


sense . 


(c)  Constrained  stochastic  approximations.  Classical 


stochastic  approximation  basically  consists  of  sequential  Monte- 
Carlo  procedures  for  finding  zeroes  or  minima  of  regression 
functions.  Many  problems  in  applications  involve  constraints, 
although  much  less  work  has  been  done  on  the  constrained  problem 
(see  Fabian  [4],  Kushner  [7],  Kushner  and  Sanvicente"  . -i.  - 

[8] ) . We  will  discuss  a version  of  Theorem  1 for  a simple- 

problem  with  equality  constraints. 

Let  4>  ( * ) denote  a continuously  differentiable 
valued  function  on  Rr,  q < r and  let  $>(x)  denote  the  Jacobian 

matrix  of  $ (x) . We  wish  to  find  a parameter  x which  minimizes 

f(*)  over  the  feasible  set  B = (x:  <f>(x)  = 0}.  The  function  f(*) 

is  not  known,  but,  at  any  parameter  setting  Xn,  we  can  observe  (as 

for  the  classical  stochastic  approximation)  a random  variable  which 

we  write  as  f (X  ) + a 8 + a £ , where  B and  £ represent 

x n n n n n n n 

an  observation  "bias"  and  observation  "noise",  resp. 

Assume  that  $'(x)<Kx)  - 0 implies  that  $(x)  is  of 
full  rank  (hence  that  <p  (x)  = 0 also)  . Some  such  assumption  is 
usually  required  of  numerical  algorithms,  to  assure  that  the 
iterates  will  not  "get  stuck"  outside  of  B,  even  in  the  purely 
deterministic  case. 


£7* 
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In  the  unconstrained  case,  the  algorithm  (4)  seeks  a 
stationary  point  of  f(*);  i.e.,  a point  x where  the  necessary 
condition  for  minimality  fx(x)  = 0 holds.  In  the  constrained 
case,  we  also  seek  a feasible  point  where  a necessary  condition 
for  optimality  holds  (the  usual  one  of  the  calculus) . In 
particular,  we  seek  points  x e B such  that  there  is  a vector 
X of  Lagrange  multipliers,  with  components  X X such 
that 

(26)  fx(x)  + $ ' (x) X = 0. 


Let  S denote  the  set  of  such  feasible  points  and  suppose  that 
S H B is  bounded v and  connected. 

Let  k denote  a fixed  positive  number.  The  algorithm 
(27)  was  discussed  by  Kushner  and  Kelmanson  [9]  , who  proved 

w.p.l  convergence  to  the  set  S Pi  B under  the  conditions  (among 
others)  that  satisfied  (5)  and  that  the  {a^}  were  square 

summable . 


(27) 


n+1 


= X - 
n 


WW*  + 


IT  ^ + 

nsn 


Vn  + M’<V*(xn>]  ’ 


where  we  use  the  definitions 

tt(x)  = [I  - (x)  ($  (x)  (x)  )_1$  (x)  ] 


and 
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The  inverse  notation  implies  that  the  pseudo  inverse  is  used 
if  the  inverse  doesn't  exist.  The  matrix  ir(x)  is  just  a 
projection  onto  the  orthogonal  complement  of  the  rows  of  4>(x). 

If  x is  feasible  and  ir(x)fx(x)  = 0,  then  x e S O B. 

Assume  (Al)  - (A5)  and  (12).  Define  the  functions  XN(*)» 
N > 0,  for  the  sequence  generated  by  (27) , as  they  were  defined 

M 

in  Section  1 for  the  sequence  generated  by  (4).  Then  {X  (•)} 
is  tight  on  Dr(-c°,a>),  and  if  N indexes  a subsequence  which 
converges  in  distribution,  then  the  limit  X(-)  satisfies 

(28)  X ( t)  = -Tr(x)fx(x)  - k<I> ' (x)  <)>  (x)  on  (-00,00)  . 


Next,  we  check  that  X(t)  e B.  Consider  the  Liapunov 
function  P(x)  = <t> ' (x)  4>  (x)  . Computing  the  derivative  P(X(t)) 

along  trajectories,  we  get  (at  X(t)  = x) 

P(x)  = — 24> ' (x)  4>  (x)  [it  (x)  fx  (x)  + k$ ' (x)  <j>  (x)  ] . 

Since  tt(x)  projects  onto  the  subspace  orthogonal  to  that 
spanned  by  the  rows  of  <J’(x),  $ (x)  tt  (x)  • v = 0 for  any  vector 


v. 


Hence 
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(29)  P(x)  = -2  | <J> ' (x)<J>(x)  |2. 

Since  (x)<j> (x)  = 0 =>  cf> (x)  = 0,  (29)  and  the  fact  that  P(x)  > 0,  imply 
that  the  only  invariant  sets  M for  the  system  (29)  must  satisfy 
the  property:  if  x e M,  then  <J>  (x)  = 0.  Next,  using  the 

fact  that  tt(x)  is  a projection,  and  fx(*)  is  a gradient  of 
a function  that  is  bounded  from  below,  we  get  that  the  invariant 
sets  must  be  included  in  {x:  ir(x)fx(x)}  = 0.  Thus  the  in- 
variant sets  are  in  B D S.  Following  the  reasoning  in  the 
conclusion  of  the  proof  of  Theorem  1,  we  see  that  XR  B H S 
in  distribution  as  n -*■  °°,  and  (13)  holds. 

(d)  The  technique  can  clearly  be  applied  to  Robbins- 
Munro  procedures,  and  to  other  algorithms  of  the  form  (4) , where 
f x ( • ) is  replaced  by  a suitable  alternative. 

The  technique  is  useful  for  getting  rates  of  convergence. 
This  topic  will  be  developed  in  a subsequent  paper,  for  various 
constrained  and  unconstrained  problems.  Even,  the  conditions  on 
f x ( • ) can  be  weakened,  provided  that  a proper  flow  can  be  defined 


for  x 
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RATES  OF  CONVERGENCE  FOR  SEQUENTIAL 
MONTE  CARLO  OPTIMIZATION  METHODS 

Harold  J.  Kushner 

Abstract 


Sequential  Monte  Carlo  Methods  of  the  Stochastic  Approxima- 
tion (SA)  type,  with  and  without  constraints,  are  discussed.  The 
rates  of  convergence  are  derived,  and  the  quantities  upon  which 
the  rates  depend,  are  discussed.  Let  {Xn>  denote  the  SA 
sequence  and  define  Un  = (n+l)^Xn  for  a suitable  6 > 0.  The 
{Un>  are  interpolated  into  a natural  continuous  time  process,  and 
weak  convergence  theory  is  applied  to  develop  the  properties  of 
the  tails  of  the  sequence.  The  technique  has  a number  of 

advantages  over  past  approaches  - advantages  which  are  discussed  in 
the  paper.  It  gives  more  insight  (and  is  apparently  more  readily 
generalizable)  than  do  other  approaches  - and  suggests  ways  of 
improving  the  convergence.  The  particular  "dynamical"  nature  of  the 
approach  allows  one  to  say  more  about  the  "tail"  process  - and  to  do 
more  "decision"  (or  "control")  analysis  with  it. 
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RATES  OF  CONVERGENCE  FOR  SEQUENTIAL 
MONTE  CARLO  OPTIMIZATION  METHODS 


Harold  J.  Kushner 


1.  Introduction. 

The  subject  of  stochastic  approximation  (SA)  for  uncon- 
strained systems  has  been  well  developed  in  many  respects  over  the 
past  25  years.  See,  e.g.,  the  references  in  Wasan  [1],  and  also 
Ljung  [2,3].  The  treatment  of  SA  under  constraints  is  relatively 
recent;  see  Fabian  [4],  Kushner  [6],  Kushner  and  Gavin  [5], 

Kushner  and  Sanvicente  [7] , [8] , Kushner  and  Kelmanson  [9] , and 
Kushner  [10] . The  SA  problem  (with  or  without  constraints) 
occurs  when  wishes  to  choose  a parameter  x e R (Euclidean 
r-space)  of  a system  which  (at  least  locally)  minimizes  a scalar 
valued  performance  function  f(x)  (without  or  under  constraints), 
but  where  the  form  of  f(*)  is  unknown  (as  it  usually  is  in 
complex  control  problems) , and  where  only  noise  corrupted  measure- 
ments of  the  performance  can  be  made,  at  various  selected 
parameter  settings.  The  algorithms  give  a sequence  of  parameter 
values  {XnJ  which  converges  to  a local  minimum  in  some  statistical 
sense.  The  subject  is  a stochastic  Monte-Carlo  form  of  the  general 
computational  problem  of  non-linear  programming,  and  has  numerous 
applications  to  control  theory  and  practice  in  the  areas  of  optimizatiol 
identification  and  tracking. 

All  of  the  previous  works  on  the  constrained  problem  treat 
only  the  fact  of  convergence.  Here  we  give  results  on  rates  of 


j 


convergence,  and  obtain  some  new  results  for  the  unconstrained 
problem  also.  In  particular,  we  will  show  that  when  suitably 
scaled  and  interpolated,  the  "tail"  of  the  SA  process  converges 
weakly  to  a linear  diffusion  process.  The  scaling  and 
properties  of  the  diffusion  give  the  rates  of  convergence,  and 
much  interesting  additional  information  as  well.  Some  of  the 
advantages  will  be  made  clear  below. 

In  Section  2,  the  unconstrained  algorithm  is  introduced. 

Sections  3 and  4 introduce  a Lagrangian  and  augmented  penalty 
function  algorithms.  The  unconstrained  problem  is  further 
developed  in  Section  5,  and  the  theorem  stated.  Section  6 contains 
some  background  on  weak  convergence  of  a sequence  of  probability 
measures  on  certain  metric  spaces.  This  theory  is  a very 
natural  tool  for  analyzing  the  asymptotic  properties  of  the 
interpolation,  and  for  obtaining  useful  information  on  this  "tail". 

In  particular,  it  enables  us  to  exploit  the  statistical 
structure  of  the  tail,  to  enhance  the  rate  of  convergence.  The 
method  of  proof  is  new  in  SA,  and  seems  to  be  more  easily  generalizable 
(to  other  types  of  noise  sequences  and  to  other  types  of  constrained 
problems)  than  past  approaches.  Relatively  little  is  known  concerning 
the  statistical  properties  of  SA  sequences,  considered  as  a process. 

Our  approach  seems  to  be  a useful  tool  for  dealing  with  such  properties, 
for  it  emphasizes  the  "process"  aspects  of  the  problem.  See,  also  the 
remarks  after  the  statement  of  Theorem  5.1. 

2 . The  Unconstrained  Case.  Formulation. 

1 A.  U 

Let  Xn  (with  components  X^,...,XN)  denote  the  n 
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estimate  of  the  local  minimum,  and  let  e^  denote  the  unit 

vector  in  the  itla  coordinate  direction,  and  let  {an,cn}  be 

null  sequences,  an  being  a positive  definite  matrix,  and  c 

a (finite  difference  interval)  positive  scalar.  Define  the 

"observation  difference"  6Yn  = { 6Y^, . . . , 6Y^}  and  the 

observation  noises  jS1'2  by 

n n J 


5Y1  = (observation  at  parameter  (X„  + e.c  ) ) - 
n n i n 

(observation  at  parameter  (Xn  - e^c^  ) 

- [f(xn+eicn'  + 4'1)  * <f<Veicn>  + ^ 


Let  ? n = ? n '2  and-  ?n  = (Cn,**°'Cn)'  and  let  &n  denote 

the  a-algebra  determined  by  XQ, . . . ,Xn,£0, * * * '^n-1* 

Define  Xn+1  by  (5fn  = <«fj <>.  Sfj  = f<Vei°n> 


(2.1) 


Xn+1  = Xn  - 


n 

Tc 


6f 


<5Y  = X_  - a_{ 


n 


K 


n 


n 


n 
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The  w.p.l,  convergence  of  {Xn}  (when  there  is  only  one 
stationary  point  of  f(*))  has  been  the  subject  of  most  of  the 
references  in  Wasan  [1],  and  we  mention  only  some  of  the  conditions 
usually  assumed,  namely: 


j 

i 


4. 


(2.3) 


la  = 00 

L n 
n 


(2.4) 


l a c < °° , I a /c  < 
“ n n “ 1 n'  n 

n n 


The  conditions  were  considerably  relaxed  in  Ljung  [2]  and  in 
Kushner  [10],  although  they  required  more  smoothness  on  f(«) 
than  the  previous  works  did.  Also,  [10]  proved  convergence  to  a 
stationary  point  of  f(*),  even  when  not  unique.  The  conditions 
required  here  will  be  given  later. 

By  simple  alterations  in  the  calculations,  it  is  possible 
to  treat  non-central  difference  and  continuous  time  forms. 

3.  The  Constrained  Problem.  A Laqrangian  Method. 

Now,  suppose  that  we  wish  to  modify  the  problem  so  that 
f(x)  is  minimized  under  constraints  qi(x)  < 0,  i = l,...,s,  where  each 
qi(*)  is  continuously  differentiable.  Let  Q(x)  denote  the 
matrix  {q.  (x),...,q  (x)},  where  q.  (•)  is  the  gradient 

1 / A Of  X lrX 

and  let  {b^} , i = l,...,s,  denote  positive  null 
sequences  and  let  A = (A1,...,!53),  X1  > 0.  Consider  the  Lagrangian 
algorithm. 


(3.1)  A*+1  = max[0,X^+b^qi (Xn) ] , i = l,...,s. 


(3‘2)  Xn+1  - Xn  - an{IiT  + 

n 


e>  •*.  ••  > 
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Suppose  that  there  is  a known  number  M such  that  the 
constrained  minimum  9,  and  the  corresponding  multipliers  X 
satisfy  | 0 1 | < M,  X1  < M.  Modify  (3.1),  (3.2)  by  projecting  on 

to  the  sets  (x:  lx1!  < M},  {X:  X1  < M},  whenever  the  bounds  are 
exceeded.  Then,  under  conditions  (2.2)  - (2.4),  and  convexity 
conditions  on  f(*)  and  q(*)»  [7]  proved  that  {Xn}  converges 
w.p.l.  to  the  constrained  minimum  9. 

It  was  not  proved  that  Xr  converged  to  an  optimal  multiplier. 
Indeed,  the  optimal  multiplier  may  not  be  unique.  Yet,  let  us  note 
for  later  use,  that  in  very  many  of  the  examples  which  we  simulated, 
it  appeared  that  Xn  did  converge  to  a X such  that  the  Kuhn-Tucker 
condition 

(3.3)  f x (9 ) + Q(9)X  = 0 

held. 

4.  Equality  Constraints.  An  Augmented  Penalty  Function  Method. 

Suppose  now  that  we  wish  to  minimize  f(*)  subject  to 

equality  constraints  <p^(x)  = 0,  i = l,...,s,  where  ^i(#)  are 

continuously  differentiable.  An  SA  version  of  Mieles'  [11] 

augmented  penalty  function  method  was  developed  in  (9).  Let 

1 2 

0 < k denote  a real  number,  define  P(x)  = j J | <t> ^ ( x ) | and 
$ (x)  = {<f>  „ Cx  ),...,<J>  (•))'.  Let  tt  (x)  denote  the  operator: 

X f X S / X 
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j* 

(I-tt(x))v  is  the  projection  of  v e R onto  the  span  of 

{<t>i  „ (x)  , . . . ,4>  (•)).  The  necessary  condition  of  the  calculus  for 

i f ^ S f X 

a local  stationary  point  at  x is  iT(x)fx(x)  = 0. 

Define  the  algorithm  (Px(x)  = <J> 1 (x)  <}>  (x) ) 


(4.1) 


W * Xn  - an[*(V  2ST+ 


Under  essentially  the  conditions  (2.2)  - (2.4),  bounded  double 
differentiability  of  f(*),  and  that  $'(x)<p(x)  = 0 implies  that 
<Mx)  is  of  full  rank,  reference  [9]  proved  convergence  w.p.l  to  a 
9 such  that  TT(0)fx(e)  = 0. 


5.  Unconstrained  Problem.  Theorem  Statement. 

Return  to  the  algorithm  of  Section  2.  Let  an  = A/(n+l)a, 
cn  = C/(n+l)\  where  C,a  and  y are  positive  real  numbers,  a > y , 
and  A is  a positive  definite  matrix.  With  a bit  of  extra  complica- 
tion in  the  notation,  rather  general  an/cn  sequences  (which  do  not 
decrease  too  fast)  can  be  handled.  Let  us  list  the  following 
assumptions . 


(A5.1)  f(*)  is  continuous,  and  has  bounded  and  continuous 

mixed  second  derivatives. 


•#  > * 
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CA5.2) 


r 2 

^ an 
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< 00 
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CA5.3) 

(A5 . 4 ) 
(A5.5) 


y a = °°. 

L n 


n 


There  is  0 e R such  that  9 w.p.l. 


f(*)  has  continuous  third  derivatives  f (x)  at  x 

X ■ A • X i 

th  1 i i 

0.  Define  B(0)  = vector  whose  i component  is  this 
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third  derivative  divided  by  3! 


J 


U 


(A5.6)  E E = 0 w.p.l. 

-^n 


CA5.7)  There  is  a matrix  E(9)  such  that  E £' 

E (0)  w.p.l,  as  n 


(A5.8)  For  some  5 > 0,  < °°, 


\Un 


2+6 


< w.p.l,  all  n. 


CA5.9I) 


or 


Let  a = 1,  0 = a/3  = 2y.  Define  F(0)  = Jacobian 

matrix  of  f(*)  at  0.  Let  the  eigenvalues  of 
AF  (0 ) - 01  = have  positive  real  parts. 


+It  is  possible  to  treat  the  case  where  0 is  a random  variable. 
CA5.9)  limits  our  consideration  to  rates  for  the  sequences  which 
converge  to  a strict  local  minimum. 


A 
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(A5.9II)  Let  a < 1,  g = 2y  = a/3,  and  let  the  eigenvalues  of 
AFC0)  = K2  have  positive  real  parts. 

(A5.10)  fx(6)  = 0. 

Remark.  The  conditions  are  mostly  self-explanatory.  (A5.9)  implies 
that  0 is  a strict  local  minimum.  The  w.p.l  convergence  (A5.4) 
is  assumed,  because  we  are  concerned  with  rates  of  convergence.  The 
actual  convergence  is  proved  in  the  various  cited  references. 

Condition  CA5.6)  is  essentially  a classical  condition  in  the  subject. 

As  discussed  in  Section  11,  the  condition  can  be  readily  weakened 

provided  that  we  can  still  show  that  P{  I U I > N}  0 as  N ->  ® 

n - ' 

uniformly  in  n,  where  {Un>  is  defined  below.  Indeed,  the 
possibility  of  such  extensions  is  one  of  the  advantages  of  our 
approach. 


Let  £.  and  e.  denote  functions  whose  values  may 
i,n  i,n 

differ  from  usage  to  usage,  but  which  depend  on  Xn  and  cn,  and 
tend  to  zero  w.p.l,  as  n -*■  °°.  (In  Section  9,  they  may  also  depend 
on  ^n-)  Define  <5Xn  = Xn  - 0.  Then,  using  (A5.4,5,10),  (2.1)  can 
be  rewritten  in  the  form 


(.5.1) 


£n+l  = V ar.[Fl9ISXn+B£e>cn+ii,n0J+e2,n'Sxn]  - anV2<V 


The  next  step  is  to  scale  {6Xn>.  For  some  8 > 0 (to  be 
selected  below  - we  are  obviously  interested  in  the  largest 


6 for 
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which  the  process  {U  } makes  sense)  define  U = (n+l)B5X  . Then, 

n n n 

using  (n+2)^  = (n+1) ® (1+6/ (n+1)  + 0 (iy) ) , 


(5.2)  Un+1  - (I+BI/(n+l)-anF(e)-an£ln)Un 

- ^n(nil)6^B(e)  - an(n+l)e5n/2Cn  + ^ 


where 


en  - ‘"+1»6lE2,ncn  + 2^  ?„°  '?T»  > 


It  turns  out  that  all  limits  of  {Un>  or  of  the  interpolated  {Un> 
introduced  below  do  not  depend  on  the  (asymptotically  negligable  - for 
the  g to  be  selected)  {e^}  sequence.  To  slightly  simplify  the 
development,  we  will  drop  the  term  henceforth,  although  its  presence 
would  not  affect  any  of  the  subsequent  arguments  - except  that  an  * 
additional  term  would  have  to  be  carried. 


Interpolation.  Introduction.  The  next  step  in  the  formulation  of 
the  limit  theorem  involves  an  interpolation  of  (u^}  into  a continuous 
parameter  process.  The  form  of  the  interpolation  is  motivated  by 
the  following  observation.  Let  { denote  a sequence  of  (zero  mean) 
independent,  identically  distributed  (for  convenience  here)  random 


variables  with  unit  variance  and  e|i|j  < M < °°  for 


some  real 


y > 0,  M > 0,  and  D be  a matrix  whose  eigenvalues  have  positive  real 
parts.  For  each  small  A > 0,  define  the  sequence  {V^}  and 
function  V^(*)  by 
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v\,  = CI-AD)VA  + /E  <Pn,  VA  = X,  fixed, 

n+±  n n u 

and  VA(t)  = VA  in  [nA,nA+A).  Then  {VA(*)}  converges  in  several 
statistical  senses  to  the  process  solving 

dV  = -DVdt  + dW, 

where  W(*)  is  a Wiener  process. 

Interpolation.  Let  D[0,°°)  denote  the  space  of  real  valued  functions 

on  [0,°°)  which  are  right  continuous  and  have  left  hand  limits  at 
each  t.  Suppose  that  D[0,°°)  and  its  products  are  endowed  with 

the  Skorokhod  topology  (see  Billingsley  [16]  for  D[0,T],  Lindvall  [20] 

for  D[0,°°).)  We  mention  only  that  convergence  of  a sequence 
(xnC-)}  to  a continuous  x(*)  in  that  topology  is  equivalent  to 
uniform  convergence  on  each  finite  interval,  and  that,  under  that 
topology,  the  space  is  equivalent  to  a complete  separable  metric 
space,  in  that  there  is  a metric,  generating  the  same  topology, 
under  which  the  space  is  complete  and  separable  (which  we  suppose 
henceforth) . 

n-l 

Define  At  = (n+l)“a,  t = l At.  , t = 0,  <5Wn  = (n+l)B  Y_a£ 

l” u 

= (n+l)^+Y  a//2  (5n/Atn)  , Wn  = £ <5W^ , W 0 = 0.  For  each  integer 

n,N,  define  *£  - «N+n  - «N,  - 6WN+n,  = V and  define 

UN(-),  WNC*)  by: 


u (t)  = UN+n,  W (t)  = WN+n  - wN  - wn  uu  i-N+n  -N^N+n+i  N 

Note  that  an^n  (n+1)  ® /2  cn  = (A/2  C)6Wn  (1+n)  Y+S  a//2  and 

a (n+l)Bc2B(0)  = (AB (0 ) C2 ) At  (n+1) B_2Y . Also,  the  paths  of  ( • ) 
n n n 

and  UN(*)  are  in  Dr[0,°°). 

Dropping  the  anen  term,  we  have 

(5.3)  Un+1  = GnUn  - (A/2C)/HnCn(n+l)Y+6'a/2 

- (AB  (0 ) C2 ) AtR (n+1 ) ^-2y , 

where 


G = (1+01/ (n+1)  - AAt  (F  (0 ) + e,  _)). 
n n 11 

It  is  clear  from  (5*3)  that  unless  y + 6 - a/2  < 0, 

2 

0 - 2y  < 0,  E|U  | will  diverge.  We  use,  henceforth,  the 

maximum  0,  namely  0 = 2y  = a/3,  with  which  the  exponents  of 

(n+1)  in  (5.3)  are  all  zero. 

If 

Let  W ( • ) denote  a standard  R valued  Wiener  process  and 
U(-)  the  (stationary  process)  solution  to 

(5.4)  dU(t)  = -KU(t)dt  - AB(0)C2dt  - (A/2C)  £1//2  (0  )dW  (t)  , 

where  K = ^ or  K2  (see  A5.9). 

The  undefined  terms  (concerning  weak  convergence)  in 


- r 


Theorem  5.1  will  be  defined  in  the  next  section,  and  the  proof 
given  in  Section  7. 


Theorem  5.1.  Under  (A5.1)  to  (A5.10),  { UN  ( • ) , v/1  ( • ) } is  tight  on 
D2r[0,°°),  and  {UN(.)}  converges  weakly  to  the  U(«)  of_  (5.4). 
(I.e.,  any  weak  limit  has  the  probability  law  of  U(-)  on 
Dr[0,°°)  or  on  Cr  [0,oc)  . ) 

Remarks.  Clearly,  we  have  the  fastest  rate  of  convergence  when 

a = 1 and  A5.9I  holds.  The  optimal  normalization  scaling  6, 

and  asymptotic  normalized  variance  (lim  EU(t)U'(t))  are  not  new; 

t-*-°° 

see  Sacks  [12]  and  Fabian  [13],  at  least  for  this  unconstrained 
problem.  However,  our  framework  is  interesting  for  several  reasons 

The  technique  emphasizes  the  behavior  of  the  "tail"  of  (U  } 

n 

considered  as  a dynamical  process.  The  correlation  structure  of 
this  process  can  sometimes  be  exploited  to  yield  (at  time  N)  a 
function  of  the  (X^,X  ^,...},  which  is  a better  estimate  of  0. 

See  Section  8.  In  practice,  we  observe  the  path  "dynamically", 
and  it  is  worthwhile  to  try  to  understand  its  dynamical  behavior. 

In  certain  cases,  e.g.,  the  Lagrangian  method,  the  procedure  often 
inherently  oscillates  around  0,X,  as  it  converges.  The  properties 
of  this  oscillation  can  be  deduced  from  the  relevant  results  of 
Section  9,  and  used  to  improve  the  estimate,  or  to  design  a more 
suitable  process.  The  "constrained"  results  are  new  and  rather 
interesting. 

Perhaps  the  dynamical  structure  can  lead  to  worthwhile 
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results  on  optimal  stopping  times  - or  to  facilitate  the  use  of  SA 
in  control  processes,  where  the  interest  is  often  inherently 
dynamical.  The  proof  has  an  independent  interest.  In  particular, 
in  the  unnormalized  case  (which  also  used  weak  convergence  methods 
and  a "dynamical"  approach)  [10  ] , we  were  able  to  prove  convergence 
for  a much  wider  and  more  realistic  class  of  noise  processes  than 
those  usually  considered, and  it  is  quite  likely  that  the  "rate" 
proofs  can  be  extended  to  the  wider  types  of  noise  sequences,  (see 
Section  11  for  a partial  such  result) . 

Extensions . By  a slight  change  in  the  method  of  proof,  we  can  also 
get  the  extensions: 

I.  Assume  the  conditions  of  Theorem  5.1,  but  set 
8 = min[2y,a/3].  If  2y  < a/3,  (resp.  2y  > a/3)  then  noise 
(resp. , bias)  is  relatively  unimportant  in  the  limit  and  the  theorem 
holds  with  dw  (resp.,  B(0))  set  equal  to  zero.  If  a = 1,  and 
0 < b < 8 and  the  eigenvalues  of  [AF(0)  - bl]  have  positive  real 
parts,  then  the  interpolation  of  { (n+1)  (Xn~0 ) } converges  weakly  to 

the  zero  process. 

II.  One-sided  differences.  Use  the  observation  difference 

[observation  at  (X  +c  e.)  - observation  at  X 1/c  instead  of 

n n l n n 

6Y*.  Then  the  theorem  holds  if  A/2C,  B. (0)  and  8 = 2y  = a/3 
n x 

(or  8 = min [ 2y , a/3 ] in  I above)  are  replaced  by  A/fc  , f v (0)/2 

l i 

and  8 = y = a/4  (or  8 = min[y,a/4]  in  I above),  resp. 

Theorems  9.1  and  10.1  can  also  be  extended  in  the  same  way. 


V * ^ 
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6 . Weak  Convergence. 

The  material  is  in  Billingsley  [16],  See,  also  Whitt  [17], 
Kushner  [18,  Chapter  2],  Iglehart  [19]  or  Lindvall  [20],  who 
gives  the  extensions  of  weak  convergence  on  D[0,T]  (for  some  real  T) 
to  that  on  D[0,°°).  Let  {Zn}  denote  a sequence  of  random  variables 
with  values  in  a complete  separable  metric  space  S (such  as 
rl  r2  r2 

D [ 0 ,°°)  , or  D 1 0 , 00 ) x R or  R , for  some  integers  r^#r2)» 
with  induced  measures  { Pn } on  the  Borel  sets  S/  of  S.  (On 
D [ 0 , °°)  , Sf  is  also  the  Borel  algebra  over  the  coordinate  pro- 
jections; i.e.  over  the  sets  {x(*):  x(t)  e A = real  Borel}.) 

The  {Pn}  sequence  [and  also  {Zn},  here)  is  said  to  be 
tight  if  for  each  e > 0,  there  is  a compact  e Sf  such  that 

Pn  (Ke ) > 1 - e,  all  n.  If  (P  } is  tight,  then  any  subsequence 
has  a further  subsequence  which  converges  weakly  to  some  measure  P 
on  CS,_90  . If  (Pn}  converges  weakly  to  P,  then 


f (x)  P (dx) 


f (x)P (dx) 


14. 


for  every  bounded  measurable  f ( - ) which  is  continuous  on  a 


measurable  set 

SQ  C S,  such  that 

P(S0)  = 

1. 

If 

(S,50 

x S2,  x 

5^) , and  p"  is  a 

measure 

on 

(s., 

i = 1,2,  all  n 

, then  tightness  of 

(Pj  x 

Pn} 

2* 

on 

(S,^) 

implied  by  tightness  of  {P^}  on  (S^  , for  each  i. 

If  Pn  -*■  P weakly,  and  if  Z is  a S valued  random  variable 
with  values  in  {S,SS)  and  with  measure  P,  we  say  that  Zn  -+  Z 

weakly  also.  In  this  sense,  Theorem  5.1  is  understood  to  mean 

that  the  measures  of  the  Dr[0,°°)  valued  random  variables 

N Nr 

U (•)  (or  the  measures  that  U (s)  , s < 00  induce  on  D [0,°°)) 

converge  weakly  to  the  measure  that  U(*)  induces  on  Dr[0,°°). 

In  our  case,  Zn  will  be  identified  with  some  combination  of 

Un(*)»Wn(*)  and  Un.  Also,  it  will  be  proved  that 

{ Un  ( - ) , w11  ( - ) , } is  tight  on  D2r[0,°°)  x Rr. 

Let  Zn  ( • ) be  a sequence  of  processes  with  paths  in 

D [ 0 ,°°)  = S,  w.p.l,  and  induced  measures  Pn  on  ( S,J S’). 

Then,  there  is  tightness  of  { Pn } or  { Zn  C * ) } on 

D[0,»),  if  the  restrictions  to  D[0,Tk]  are  tight  for  some 

sequence  T^  °°.  We  sometimes  use  (without  explicit  mention) 

the  fact  (and  similar  facts)  that  if  E max|zn(t)|  -►  0 as 

t<T 

n -*■  <»  for  each  T,  and  Zn(*)  e D[0,°°)  w.p.l,  then  Zn(*) 

converges  weakly  to  the  zero  element  of  D[0,°°).  We  note  for 

future  use,  that  if  h(*,*)  is  a bounded  continuous  function, 

and  Zn(«)  -»■  Z (• ) weakly,  where  Z(*)  has  continuous  paths, 

t n 

h(t,s)Z  (s)ds  converge  weakly 

0 


then  the  processes  defined  by 


to  the  process  defined  by 


[ h(t,s)  z (s)ds. 
J 0 


7 . Proof  of  Theorem  5.1. 

1°.  Returning  to  (5.3)  and  using  3 = 2y  = a/3,  we  first 
show  that  {U^}  tight  on  Rr.  Define  (neglecting  an^n» 

as  we  can  easily  show  is  legitimite)  {vn>  and  Un  by 


n+1 


= G v 
n n 


- AB(0)C‘ 


At  , 
n' 


u = 

n 


U 


- v 


n 


Note  that,  if  random  functions  Z and  ZG  take  values 

n n 

in  the  same  space  for  each  e > 0,  and  differ  on  at  most  an  w 

set  of  measure  e,  and  if  {Z^}  is  tight  on  some  space  for  each 

e > 0,  then  { Z > is  tight.  Thus,  since  7 ->  0 w.p.l,  if 

^ i f n 

tightness  (and  the  theorem)  is  proved  under  the  assumption  that  for  each 

e > 0,  there  is  an  integer  N < » such  that  17,  I < e for 

e l , n 1 - 

n > N£,  and  lG^/nl  is  uniformly  bounded,  then  they  will  be 

true  in  general.  We  make  the  assumption  on  7,  . Under  this 

1 , n 

assumption  Cvn}  is  bounded,  hence  tight  on  Rr.  We  only  need 
to  prove  that  (Un)  is  tight.  We  have 

Vl  - Gn°n  - lA/2c'  «„• 

By  (A5.6),  (A5.8),  and  under  cases  (A5.9I)  or  (A5.9II),  there  are 
positive  constants  MX,M2  such  that,  for  large  n. 


* m. 

J 

.1 

1 


Xl5"+1'2  S 'Gn|2,5nl2  + MlAtn 


< U-M2itn)|Sn|2  + Ml4tn, 


<w  2 

from  which  boundedness  of  { E | Un | },  hence  tightness  of  (Un } 
follows. + 


2°.  A representation  for  UN(*)„  Define  C?  = I for 
i > j and  C?  = G^  ...  Gi  for  i < j.  Then  (5.3)  is  solved  to 


M+n  N+n  M 2 

Ul  = Si  % ' LCm+l((A/2C)'5Wn>  + AB  <S  1 c 4tm1' 

m=N 


or,  equivalently,  Ca  more  convenient  form  for  us) 


Wi  - - lCI(a/2c,(9Vi-wn)  - <w> 


+ ®»>c  «wv  - (w» 


For  the  moment,  let  us  consider  only  the  sum  in  (7.1)  in- 
volving the  W^.  Denoting  that  sum  by  lj^+n  and  rewriting  it 
by  collecting  the  coefficients  of  each  VT,  yields 

+It  is  precisely  the  difficulty  of  proving  tightness  of  {Un) 

when  (A5.6)  is  relaxed,  that  forces  us  to  require  (A5.6).  See 
the  last  Section,  where  relaxations  of  the  condition  are  discussed. 


J w 
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'C"*  ! C‘wUV2cu',"‘V  + (A/2C)  Vrt1  • 

m=N 

Using  C^+n  - = -c!lT?  (KAt  + e.  At  ), where  K = K or  K 

3 m m+1  m+1  m 3,m  m 1 z 

according  to  the  case  of  (A5.9),  yields  I^+n  = l!?+n  + 


N 


?N+n 


+ In  + CA/2C)  CWN+n+rV'  where 


-N+n  _ Nrn„N+n  ,„iriv  -1 , 


In  = I<S™(  c-,  *f*Vtt/2C)(VV 


m=N 

*N+n  _ N+n  ,„m.-l 


N+n 


m=N 


Now,  for  each  N,  define  CN(*)  on  f0,“)  by 


N+n 

CN^  = CN  on  ^N+n^N'  tN+n+l~tN+n) * 


Since  \ (.At  ) ->  0 as  N -*■  °°,  we  have  that  CN(t)  “ exP  “ Kt» 

n=N  n 

as  N -*■  °°,  uniformly  (w.p.l)  on  finite  time  intervals. 

A *" 

For  each  N,  define  the  interpolations  IN  ^ ^ ' ^"n  ^ ' IN  ^ 

_ . r _N+n  , $N+n  r ~N+n  ■,  . 

of  the  sequences  IIN  },  {IN  ),  tIN  ),  by  e.g.. 


INtt)  IN  °n  ftN+n"tN,tN+n+l"tN) * 


It  is  not  hard  to  show  that  {!(•)}  is  tight  on  Dr[0,<=°)  and 
goes  to  the  "zero"  process  weakly,  as  N -*■  °°.  Thus,  we  ignore 
it  henceforth.  Define 


i| 

I 
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1 


J 


JNCt)  - - |^CNCt")c”^  Cs)K(A/2C)W** (s)ds  + (K/2C)\P{t) 

Cexp  - K(t-s)  ^(A/WW^sJds  + (A/2C)WN(t) 


0 

ft 


It  is  also  not  hard  to  show  that 


(JN(t)  " {JN(t|  ’ IN(t)) 


J 


are  each  tight  on  D [0,°°)  and  tend  to  the  "zero"  processes  weakly 

as  N -*  <».  A similar  development  can  be  made  for  the  sum  in 

(7.1)  which  involves  the  (t  -t„) . Putting  all  the  foregoing 

m N 

together  and  adding  the  neglected  terms,  we  have 


(7.2) 


UN(t)  » (exp  - Kt)UN(0) 


+ f (exp  - K(t-s)  )K(A/2C)Wl,(s)ds 
J 0 

- (A/2C) (t ) 

r (exp  - K(t-s)  )£(AB(0)C2)s  ds  - (AB(6)C2)t 


N 


+ eN(t), 


where  e^C*)  a Process  which  tends  to  the  zero  process  weakly, 

as  N °°.  Note  that  if  a subsequence  of  (UN(0),  W1*  ( • ) } 

converges  weakly,  so  does  the  subsequence  of  {U  (•)),  and  the 
limit  process  has  continuous  paths  w.p.l. 


Kfh*  •* 


**  t 


j 
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3°.  Suppose  that  W**  (* ) were  tight  on  Dr[0,°°)  and 
that  any  weak  limit  is  a Wiener  process  with  covariance  E ( 0 ) t . 
Then  the  limit  of  {U  (•)}  corresponding  to  any  weakly  con- 
vergent subsequence  of  {W1^  (.* ) ,UN  (0)  } is  equivalent  (in  law  on 
Dr[0,<»))  to  the  process  given  by  (7.2)  with  UN(0),WN(*) 
replaced  by  some  U(0),W(*),  where  W(*)  is  a Wiener  process 
with  covariance  E(6).  The  theorem  follows  from  this,  since  the 
resulting  right  hand  side  of  (7.2)  solves  (5.4)  (as  an  integration 
of  (7.2)  by  parts  will  show). 


4°.  Thus,  we  only  need  to  prove  the  first  sentence  of  3°. 
We  use  Theorem  2 of  Scott  [14]  with  an  appropriate  change  of 
notation.  Scott's  result  is  the  following. 

Let  {v^}  denote  an  array  of  scalar  valued  random 
variables  where  v^,  i < n,  are  measurable,  for  some 

sequence  of  a-algebras  which  is  non-decreasing  in  n,  for 


each  N.  Let  E be  a positive  real  number.  Let 

i 

M 

E 


5 Mv?  = 0 w.p.l,  and  define  m_(t)  = max{i:  E £ (v1!1)^  < t£} 

N 3=0  3 


N,  2 


t e [0,°°)  , all  N.  Define 


(7.3) 


mN(t) 

At)  = l vj. 

i=0 


Let (condition  B of  Scott's  Theorem  2; 
probability) 


is  convergence  in 


t -Me***. 
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"Vi  Ct) 

a. 41  I E{  Cv*?)2|  -£-»*  It, 

i=0  1 X 

C7.5)  l E{  Cv^)  2I  — - — > 0,  all  e > 0, 

1=0  1 {|vj|>e}  i 

as  N -*■  °°,  for  each  t < «>.  Then  { ( • ) } is  tight  on  D[0,«>), 
and  the  limit  of  any  weakly  convergent  subsequence  is  a Wiener 
process  with  covariance  Et  and  mean  zero.  Actually  Scott  deals 

with  D [ 0 , T]  and  E = 1,  but  his  theorem  is  valid  on  D[0,°°)  also 
(and  for  any  E > 0,  by  a simple  scaling). 

Assume  that  E(0)  ? 0,  for  otherwise  the  result  is  trivially 
true.  Let  X e Rr  be  such  that  X'E(0)X  > 0.  Identify  with 

^N+i'  Vi  w^t^1  X’6W^,  and  E with  X'E(0)A.  Note  that  the 

convergence  Xn  -*■  0 and  (A5.7)  imply  (7.4).  Condition  (A5.8) 
implies  (7.5).  Thus,  by  Scott's  theorem  (V*1^))  is  tight  and 
converges  weakly  to  a Wiener  process  with  mean  zero  and  covariance 
(X'E(0)X)t,  as  N -*■  °°.  Note  that 

E X^2  • E 

and  E^N+j^^+j  E(0)  as  N,  j -*■  «>.  Thus,  the  result  remains 

true  if  El^'^N+j^2  = E(^j)2  is  replaced  by  X'E(0)X  in  the 

N 

definition  of  m^t).  With  this  change,  the  definition  of  V (•) 
is  the  same  as  that  of  X'W^(*);  thus  A’W^(*)  converges  weakly 
to  a Wiener  process  with  mean  zero  and  covariance  X'E(8)X. 

Since  the  result  of  the  last  paragraph  is  true  for  each  X 


21. 


(if  l'E(0)X  = 0,  then  the  W1^  C* ) converge  to  the  "zero"  process), 
we  can  conclude  that  is  tight  on  Dr[0,°°),  and  that  it 

converges  weakly  to  the  desired  R valued  Wiener  process,  with 
covariance  £C9),  degenerate  or  not.  Q.E.D. 

8 . Exploitation  of  the  "Correlation  Structure"  of 
the  Limit  Process. 

In  order  to  illustrate  a possible  useful  application  of  the 
correlation  properties  of  (5.4),  consider  a simple  scalar  case  with 
B(0)  =0,  2C  = 1 , 0 = a/3  . Write  F = F (0 ) , Z (0)  = a2;  let 
a = 1 and  let  K = AF  - 3 > 0.  Then  (in  steady  state) 

(8.1)  EU(t)  = 0,  EU (t) U ( t+s ) = V exp  - Ks,  s > 0, 

V = A2cr2/2  (AF- 8 ) . 

V is  minimized  by  A = 2 g/F  = Ag. 

In  practice,  F is  not  usually  known.  In  order  to  reduce 
the  sensitivity  of  the  iterate  sequence  (Xn>  to  "large  initial 
noises" , and  to  partially  rectify  the  slow  convergence  that  occurs 
when  A is  too  small,  it  is  common  for  an  A > Ag  to  be  used 
(where,  of  course,  F must  be  guessed) . For  similar  reasons,  an 
a < 1 is  sometimes  used,  and  what  follows  also  holds  for  the 
a < 1 case  (in  which  case  the  correlation  (8.1)  uses  3=0  and 
K = AF) . As  A increase,  K increases,  and  the  process 

O 

{ (n+1)  (X  -0) } = (U  } behaves  more  "wildly".  We  will  try  to  exploit  this. 


22. 


Let  be  [0,11,  0 < t < ~ and  define  qN(t)  = 
i 

= max{i:  l Atj+N  < t) , qN(0)  = 0.  Using  the  fact  that 


3=0 


n^/3 (X  -6)  — > N(0,V),  together  with  the  correlation  structure 


of  (8.1)  yields,  for  large  N, 


(8.2) 


E[btxN-e)  + U-b)  (xN+qu tt)  - 6)] 


N 


vr  4.  2b(l-b)exp-Kt  . (1-b)l 2  , 

,fnV3  + ,fn2/3] 


N ' (N+q^ (t ) ) 


(N+qN(t)  ) 


C(8.2)  holds  as  a limit  relation  if  we  multiply  both  sides  by 
SI2^3,  and  let  N ■+•  °°. ) 

By  definition  of  qN(*), 


N+qN Ct) 


l At^  % t % log[N+qN(t)]  - log  N, 
i=N 


and  e_t  (N+qN  (t)  )/N  -*■  1,  as  N -*■  <*>.  Then  (8.2)  is  approximately 


(the  ratios  tend  to  unity  as  N ■*  °°) 


(8.3) 


N 


173 


[b  +2b(l-b)exp-(K+l/3)t  + (1-b)  exp-2t/3] 


At  b = 0,  the  derivative  of  (8.3)  with  respect  to  b is 


(8.4) 


2V 


N 


173 


(exp-t/3) [exp-Kt-exp-t/3] 


■’*  ■ t 


con 


ditions  Cat  least  from  an  "asymptotic"  point  of  view) , the  linear 


combination  inside  (.8,2)  does  not  yield  an  improved  estimate.  But 


if  A > An , then  C8.4)  is  < 0,  suggesting  that  we  can  improve  the 


combination  of  past  iterates.  Such  ideas  have  a natural  appeal 


and  such  smoothing  is  sometimes  used  in  practice,  irrespective  of 


the  value  of  A;  but,  we  see  that  it  can  be  harmful,  unless  A > A 


(or  a < 1) , and  the  weights  are  carefully  selected 


An  open,  and  interesting,  question  in  the  general  vector 


case  is  whether  A 


0 w.p.l),  but  such  that  K has  some  complex  eigenvalues 


sequence  would  then  exhibit  "oscillations"  on  the 


which  perhaps,  could  be  exploited  (via  suitable  smoothing 


Calculations  made  on  simple  sample  problems  indicate  that 


complex  eigenvalues  (often  corresponding  to  the  eigenvalues  with 


algorithm  of  Sections  3 and  9.  Perhaps  smoothing  can  be  usefully 


It  should  be  clear  from  the  foregoing,  that  our  inter 


poxated  process"  analysis  of  { Xn } and  (Un>  is  rather  natural 
and  that  it  can  yield  much  more  insight  into  the  properties  of 
the  sequences,  and  smoothed  functionals  of  the  sequence,  than 
can  the  more  standard  analysis  of  only  the  random  variables 


■mptotic  Rate  for  the  Lagrangian  Method  of  Section  3. 

The  development  is  close  to  that  for  Theorem  5.1,  and  only 


an  outline  will  be  given 


(A9.1) 


Assumptions.  Assume  (A5.1)  to  (A5.8)  and 

There  is  a X such  that  X X w.p.l. 

n 

(A9.2)  If  q^(0)  = 0,  then  suppose  that  X1  > 0.  (Of  course,  if 
q^(9)  < 0,  then  X1  = 0.) 

(A9.3)  Each  q^(-)  has  continuous  and  bounded  first  and  second 

derivatives  at  0. 

(A9.4)  Q(0)  (see  Sec.  3)  is  of  full  rank. 

Let  an>cn  ke  as  Section  5,  and  let 
bn  = diag Cb1, . . . ,bs)/ (n+1) a = diag  (b^, . . . ,b®) , where  b^  > 0, 
i < s.  Define 

Qi(0)  = (32qi  (x)/3xkax-^ , k,j  = l,...,r}  at  x = 0, 

1 lf**«fS* 

Define  KQ  = A (F (0 ) + £ X1Q± (0 ) ) and 

KQ  AQ(0j~ 

-BQ ' (0)  0 

(A9.5)  Let  a < 1 (a  = 1,  resp.)  and  let  the  eigenvalues  of 
i<2  (Kf  = K ^ - 6 1 / resp.)  have  positive  real  parts. 

(A9.6)  Let  0 satisfy  the  Kuhn-Tucker  condition 

fx(0)  + l lic3ifX(0)  = 0. 


■*  T 
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Remark.  The  A^  -*  A convergence  was  not  proved  in  [7]  , but  it 
appears  from  our  simulations  that  convergence  of  {A  } occurs  quite 
frequently.  (A9.2)  is  often  assumed  in  the  deterministic  case;  it 
simply  says  that  the  "economic  price"  of  an  "active"  resource  is 
positive  at  the  optimal  point. 

If  qiC0)  < 0,  then  qi(Xn)  qi(0)  < 0 and  (3.1),  the 

convergence  Xr  -*•  0 , and  divergence  of  £ b together  imply  that 

i n 

An  = 0 for  all  large  n.  We  will  henceforth  ignore  q^  if 

qi(0)  < 0.  We  can  and  will  assume  that  all  s constraints  are 

active  at  0 - with  no  loss  of  generality.  The  linear  independence 

of  the  q.  (0)  is  also  commonly  assumed  in  the  analysis  of 

deterministic  algorithms. 

Tf  Kq  is  positive  definite  and  A and  B are  diagonal  with 
the  same  constant  diagonal  elements,  then  (using  A9.4)  Polyak  [21, 
proof  of  Theorem  1]  implies  (A9.5). 


Development  of  the  Algorithm.  Owing  to  the  remarks  above  and 


since  we  are  only  concerned  with  large  n,  we  can  write  (3.1) 
as  A^+1  = + b^qjL(Xn),  i = l,...,s.  Define  6An  = An  - A. 

Then,  using  (A9.6),  we  can  write 


"t  r 


26 


<$X 

3A 


n+1 


n+1 


Xr  “ an«0+el,n>' 


VQ'(9>  + e3  ,n^ 


-a  (Q (0 ) + e,  ) 
n t , n 


’6Xn 

(9.1) 


a (B(9)c2+e.  c2) 

n n 4 ,n  n 

a 

V 

0 

n 

2cn 

.0  . 

where  I = identity  in  Rfc. 

The  terms  e . all  depend  only  on  X and  A and  tend 
i,n  n n 

to  0 w.p.l  as  n by  the  convergence  (of  {Xn,An>)  and 

smoothness  (on  f(*)»q(*))  assumptions. 


Define  U 

n 


(n+1)  6 


l nj 

(5.2)  was  obtained  from  (5.1), 

properties  of  the  e . above) 

l , n 


. Following  the  method  by  which 


we  get  (the  e . have  the 
i,n 


Un+!  " II  + 


(n+lT  1 ' (n+1>‘“(K2+el,n>}Un 


(9.2)  - (n+l)_a(n+l)0‘2Y 


AB(0)C2  + e2,n 


- (n+l)"a/2(n+l)"a/2+B+Y(A/2C) 


fCI 


n 


(1+0  (-))• 
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Define  At^  = (n+1)  a,  tR,  <$Wn(-)  and  ( • ) as  in 
N 

Section  5,  Let  U (•  ) denote  the  function  which  equals  11, , 

N+n 

°n  ^tN+n-tN,tN+n+l-tN^  (analogously  to  the  definition  of  WN ( • ) . 
Set  3 = 2y  = a/3.  Then  (9.2)  can  be  rewritten  as  (K  = or 

K2  according  to  whether  a = 1 or  a < 1) 


(9.3) 


Un+1  * [I  - AVK+E3,n)!0n 


n 


C 2\ 

AB  (0)C 

f<5W 

> - (A/2C ) 

n 

< 

0 

V.  J 

0 

+ At  e. 
n 4 ,n 


Let  W ( • ) denote  a standard  Rx  valued  Wiener  process, 
and  let  U(*)  be  the  (stationary  process)  solution  to 


(9.4) 


AB(0)C' 

dU (t)  = -KU (t ) dt  - i Q 
I"'  “ (0)dW(t) 


dt 


A 1/2 


2C 


Theorem  9.1.  Assume  (A9.1)  to  (A9.6),  (A5.1)  to  (A5.8),  and  let 
a^  = A/  (n+1) 01 , bn  = B/(n+l)a,  cr  = C/(n+l)Y,  where  C > 0, 

B is  diagonal  with  positive  elements,  and  A is  positive  definite. 
Let  6 = 2y  = a/3.  Then  { UN  ( • ) ,WN  ( • ) } is  tight  on  D2r+S[0,°°), 
and  any  weak  limit  of  the  { UN ( • ) } has  the  law  of  the  (stationary 


* ' t ’ 


process  solution)  U(*)  in  (9.4) 


Remarks.  The  proof  is  almost  the  same  as  that  of  Theorem  5.1 


and  is  omitted. 

Owing  to  the  presence  of  the  "multiplier  dynamics"  in  (9.4), 
it  is  more  likely  that  K ^ will  have  some  complex  eigenvalues.  The 
consequent  oscillations  (of  the  correlation  function)  should  be 
exploitable,  via  ideas  such  as  those  in  Section  7,  to  yield  a 
smoothed  sequence  of  estimators  which  are  better  than  {Xj.  Indeed, 
such  a possibility  justifies  our  point  of  view  concerning  the 
advantages  of  studying  weak  convergence  of  the  processes  U (•)  to 

U(*)f  over  the  simpler  convergence  in  distribution  of  the  Rr 
valued  sequence  {Un>. 

10.  Asymptotic  Rates  for  the  Equality  Constrained 
Algorithm  of  Section  4. 

We  will  need  the  assumptions: 


29. 


The  ij>^(  ) have  two  continuous  derivatives  at  0. 

(A10.2)  The  d>  i - i . , 

^i,xl  ' 1 1/.../S  (the  rows  of  <J>(0))  are 

linearly  independent. 


Let  v (y)  (with  components  v1 (y) , . . . , vr (y ) ) denote  the 


'r,x(^ 


lr 


Let 

(^x  (y)) 

(y)/3x1 

• 

• • • 3v^ 

• 

% 

(y)/9x1 

. . . 2vr 

(y) 

wri(yr 

• • • 

• 

(y) 

wrr <*1 

[w1(y),...,wr(y7] 


Define 


K0  = ^fx(9)  )x  + k<J>'  (0)$(0)  , 


(A10.3I)  Let  a 1,  B - 2y  - 1/3.  Let  the  eigenvalues  of 

K1  = A^o  ~ 21  have  Positive  real  parts. 


or 


See  the  discussion  after  Theorem  10.1,  and,  in  particular, 
the  representation  (10.8)  for  KQ. 


1 fl 

J 
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CA10.3IIJ  Let  a < I,  (3  = 2 y = a/3.  Let  the  eigenvalues  of 
K2  = AKq  have  positive  real  parts. 

(A10.4)  0 satisfies  the  necessary  condition  for  a 

constrained  minimum,  n C0 ) f x C9 ) = 0. 

Let  E (X  ) denote  the  covariance  (given  ® ) of  the 
7r  n J n 

projection  of  onto  the  orthogonal  complement  of  the  span  of 

(X  ),..., <p  (X  ) . We  use  the  terminology  of  Section  5, 

1 f x n s / x n 

except  that  (Xn>  is  given  by  (4.1).  Under  (A5.7),  and  the 
convergence  (A5. 4), there  is  a matrix  £^(0)  such  that 
cov[7r(Xn)Cnl  -*■  2 ^ C0 ) w.p.l,  as  n 


Theorem  10.1.  Assume  (A5.1)  to  (A5.8)  and  (A10.1)  to  (A10.4). 
Define  Un  = (n+1) ^ (Xn  - 0) . Then  {Un (• ) ,WN (• ) } is  tight  on 
D [0,°°),  and  there  is  a standard  Wiener  process  W(*)  such  that 
any  weak  limit  of  (U  (•)/  has  the  (stationary  process  solution) 
law  of  the  U ( • ) in  (10.1),  where  K = or  K2,  according  to 

the  case  of  A10.3. 

(10.1)  dU(t)  = -KU(t)dt-ATTC0)B(0)C2dt-(A/2C)Eir1/2(0)dW(t)  . 

Remark.  The  proof  is  very  close  to  that  of  Theorem  5.1  and  is 
omitted.  We  remark  only  on  the  expansion  of  (4.1),  and  on  the 
conditions.  The  e.  , e.  have  the  same  meaning  as  in  Section  5. 


With  X - 0 = 6X  , we  can  write 
n n 


(10.2) 


5Xn+l  = 5X„  - + B(9)cn  * El,ncn  + 


e2  n6Xn  + ?n/2cn}+  ' (6 ) $ (0 ) 6Xn  + e_  ^6X^] 

^»nn  nn  n 3 , n n 

Using  CA10.4)  and  the  smoothness  assumptions  on  f ( - ) and 
( • ) , we  get 


(10.3) 


’(WV  * (l£x(9,1x6Xn  + E4,n{X„- 


Now,  using  (10.2),  (10.3),  the  values  of  a,B,y  in  Al0.3((Case  I 
or  II),  and  a development  of  (n+2)6  such  as  used  in  Theorem  5.1, 
we  get  (analogously  to  the  unconstrained  case) 


(10.4; 


Un+1  = [I  ~ Atn(K+el,n)]Un  ” A^Att  (0  ) B (0  ) C‘ 
mrn  (A/2C)  rr(e)  en  - 4tnlT2,n+5n0<ff)  ) • 


Using  (10.4),  the  proof  goes  as  it  does  for  Theorem  5.1. 


Remark  on  (A10.3).  Without  loss  of  generality,  suppose  that  0=0. 
Let  T(y)  denote  the  tangent  line  or  hyperplane  to  the  curve  or 
surface  {x:  <p  (x)  = 0}  at  y,  and  TQ(y)  the  orthogonal  complement 
to  T(y).  In  general  T-(0)  D span{<)>.  (0),  i < s}.  We  make  the 

U 1 f X “ 

additional  assumption  that  T (0)  = span{<)>.  (0),  i < s}.  Let 

U 1 f X “ 
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X = (x  ,.,.,xr)  denote  the  generic  point  in  Rr.  With  no  loss  of 
generality  Cand  some  gain  in  insight) , we  can  assume  that  the  basis 
is  such  that  Xj_,. . . ,xg  and  xg+1,...,xr  form  bases  for  TQ(0) 
and  T(0),  resp.  Using  the  last  three  sentences  and  (A10.2),  we 
have  that  there  is  a non-singular  (s*s)  matrix  $ such  that 


CIO. 5) 


(0)  $ (0) 


•v  N 

J 

1 1 

i 

o 

1 

O 

1 

O 1 
1 

1 1 

c r 

Let  N denote  an  e-neighborhood  of  0 e R . There  are 

differentiable  functions  Jl  (• ) on  Rr  s such  that  if 

x e N H {x:<(>(x)  = 0}  = N^,  and  e is  small  enough,  then 

Xi  = &^(Xg_^^,»..,X^),  i ~ 1,  • • . ,S. 

Henceforth,  assume  that  e is  "sufficiently"  small. 

We  will  now  develop  a representation  for  (gfx(0))x*  Let 

6 denote  a small  real  number.  By  the  definition  of  the  tangent 

2 

plane  or  line  T(0),  A^Ce^S)  =0(6  ),  j = s + l,...,r.  Consequently, 


for  j > s,  the  smoothness  of  4>  ( • ) ,tt  (• ) and  fxC*)  implies  that 


(10.6)  irCe^  + )[  e^Ce^)  ) *fxCej6  + j e^le.j)) 


= Tr(e  .6)  f (e  .6)  + 0(6*)  . 
J X J 


By  the  definition  of  w^(0). 


(10.7)  lim  iT(e.6)f  (e.6)/S  = w.(0) 
6->-0  ixi  l 


Recall  that  ^(0)fx(0)  = 0.  Note  that  for  each  i,  the  vector 


Tr(6ei)  fx  (6e^)  = [projection  of  f^(6ei)  onto  the  tangent 
plane  T(6e^)] 


has  components  0(6)  on  T(0)  and  0(6  ) on  TQ(0)._Thus»  by  (10.7), 


w^(0)  is  in  T(0)  and,  hence,  has  the  form  w^(0)  = 


some  r - s vector  g . . 


for 


VJe  will  examine  further  the  term  F,  = [g  ,,...,g  ] 

<J)  ^s+1  ^r 

Define  the  function  F(*)  on  Ne  n T(0)  by 


f(xs+l',,,'xr)  “ f (£l(xs+l'***)"**'£s(xs+l,***)'xs+l'-**,xr) 
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Note  that  the  vector  of  the  last  r-s  components  of  the 

left  side  of  (10.6)  is  the  gradient  of  f"  ( • ) at  e.6 

2 

(modulo  a term  of  order  0(6  )).  This  fact,  together 

with  (10.6)  and  (10.7)  imply  that  F,  is  the  Hessian 

<J> 

matrix  of  f(«)  at  0.  Finally,  KQ  takes  the  form 


(10.8) 


K0  = 


k$> ' <J> 


9i  f • • • f9c 


Thus  (A10.3II)  holds  if  the  eigenvalues  of  the  Hessian  F^ 
have  strictly  positive  real  parts  and  A = diagonal (a, a, ...) , 
a > 0.  In  any  case,  the  representation  (10.8)  clarifies  the 
meaning  of  the  condition  (A10.3). 


Remark  of  the  value  of  k.  If  a = 1,  we  require  that  the  real 
parts  of  the  eigenvalues  of 


AKq  - BI 

be  positive.  This  requires  that  k > some  minimum  positive  value. 
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V 


vpv — m 


ippp — ■ m 


Certain  deterministic  algorithms  [15]  also  require  a minimum  value 
of  k,  for  convergence.  Here,  convergence  occurs  for  any  k.  But, 
if  a = 1,  and  k is  too  small,  the  rate  (3)  will  not  be  a/3  = 1/3, 
but  something  less,  something  which  also  seemed  to  hold  in  our 
experiments . 


11 . Extensions. 

Assumption  (A5.6)  was  used,  because  we  were  not  otherwise 
able  to  prove  tightness  of  {Un}  in  any  generality.  Theorem  11.1 
follows  from  the  proof  of  the  previous  theorems.  It  is  much  less 
difficult  to  find  reasonable  conditions  under  which  W11  ( • ) converges 
weakly  to  a Wiener  process.  After  the  theorem  statement,  we  will 
comment  on  this. 


Theorem  11.1.  Assume  the  conditions  of  Theorem  5.1  (or  of  (9.1, 
10.1),  except  for  (A5.6),  and  suppose  that  (Un,Wn(*)}  are  tight  , 
and  {W  (•)}  converges  weakly  to  a Wiener  process  with  mean  0, 
and  covariance  Eq(0).  Then  the  conclusions  of  the  theorems  still 
hold. 


Remarks . Our  remarks  will  be  confined  to  the  unconstrained  case, 

for  the  others  are  treated  similarly.  The  a ~ in  (5.2)  causes 

n n 

no  problem,  the  difficulty  in  proving  tightness  of  (Un)  on  R 

has  been  due  to  the  e.  term,  although  it  is  "sure"  to  be 

i , n 

eliminated  eventually.  Indeed,  if  tightness  of  (Un)  can  ke 
shown,  then  even  (A5.4)  can  be  replaced  by  the  weaker  convergence 
in  the  theorem  of  Kushner  [10]. 


a 


I .• 


Define  qN(t)  = 0 on  [ 0 , At^) , qN(t)  = i on 


A^N+*  * ’ +^^N+i-l ' A*"N+’  ’ * +A^N+i^  ' Then 


N 

Ww(t)  = 


qN(t)-i 


?N+i (AtN+iJ 


Tightness  and  convergence  of  {Wn(*)}.  Let  us  drop  (A5.6)  to 


(A5.8).  We  will  replace  them  by  the  "reasonable"  conditions 


(All . 1 ) to  (All. 3).  According  to  Billingsley  [16],  Theorem  15.5, 


if  {W  (•)}  satisfies  a criterion  "similar"  to  that  used  to  prove 
tightness+on  Cr[0,°°),  then  it  will  be  tight  on  Dr[0,°°),  and  all 


limits  will  be  continuous  w.p.l.  In  particular,  by  [16], 


Theorems  15.5,  12.3  and  12.2,  and  the  definition  of  WN(*)»  this 


holds  if  there  is  a real  K such  that 


. n+m-1  _ 

4 . i r . . i 2 


(11.1)  E|Wn+m  - Wj4  < K | l At.r,  all  n,m  > n. 


Equation  ill. 2)  is  equivalent  to  (11.1)  , where  tR  = tN+n  - tN< 


n+m-1 


(11.2)  ElwVJJ+J  - wVJJ)!4  < K | l At"|2,  all  N,n,  m > n 


Just  to  simplify  the  notation  in  the  following  development,  we 


assume  that  the  £ are  scalar  valued. 

n 


Let  _#n  be  the  o-algebra  determined  by  XQ,...,X 


. £ . Assume 


+C[0,°°)  = space  of  continuous  functions  on  [0,°°)  with  the  metric 
of  uniform  convergence  on  finite  intervals. 


d 
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(All . 1 ) There  is  an  integer  k > 0 such  that  for  all  N,i, 


CN+ii  < (1+UNI  Jou,  where  ou  are  real  quantities 


qN(t) 


Ly  N 

satisfying  I 
i=0 


0 as  N -*■  for 


each  t,  where  we  define 

N £ = 0 * 


Let  R(0;J£)  and  , be  real  quantities  such  that 

i,fl  - 

q.T  Ct) 

N 1 /? 


I | R C©  ? ^ ) I < °°»  l jlAtN+iAtN+j' 

SL  ifD  = l 


as  n <■  °°,  for  each  t. 


(All. 2) 


^ E^N^N+i^N+i+£  " R(0,£)l  i (1+UNI  )6i,i+£'  1 - °f 

for  some  integer  k > 0,  and  all  N,i,£. 


(All. 3)  There  is  a bounded  function  R(*, and  real  number  K 
such  that  |E5iCjCkC£|  5 R(ifjfk,£)  and  (where  t 
t + s are  restricted  to. jump  times  of  WN  ( • ) ) 


qN(t+s)-l  i /2  2 

l R(N+i,N+j ,N+k,N+£) (AtN+iAtN+jAtN+kAtN+^)  < s 

i r j / ^ r ^ ^ ^ ^ 


i* 


Discussion  of  the  conditions.  The  conditions  can  all  be  weakened, 
but  we  do  not  know  their  "best"  form.  (All.l)  replaces  (A5.6). 

It  describes  the  type  of  mixing  or  summability  condition  that 
holds  if  the  {£m>  were  generated  by  the  solution 

to  an  equation  such  as 


'*•  • r ' 


(11. 3) 


^n+i+l  + a0^n+i  + alt;,n+i-l  **’  + aisn  vn' 


where  the  (iJj  } are  independent,  Gaussian,  zero  mean,  and 
identically  distributed,  and  the  roots  of  [ X1+^  + a^X1  + •••  + = 0) 

are  all  strictly  interior  to  the  unit  circle. 

Similarly,  (All. 2)  holds  for  (11.3).  If  the  noise  {£m} 
were  a stationary  process  - with  = R(£)/  then  (All. 2) 

would  read 


lE0N?N+i^N+i+Jl  “ E^N+i^N+i+£  ^ - (1+^nI  )gi,i+2' 


The  condition  (All. 2)  is  a type  of  "asymptotic"  stationarity  (as 
X -*■  0)  combined  with  a mixing  condition.  Conditions  (All.l)  - 
(All. 2)  are  used  to  show  that  the  limit  of  (WN(*))  is  a Wiener 
process,  given  tightness,  and  (All. 3),  which  implies  (11.2),  also 
implies  tightness.  Also,  it  holds  for  (11.3). 

(All. 3)  actually  is  not  too  restrictive.  For  example,  it 
commonly  occurs  that  there  are  functions  R(*,*)  such  that 
R (i , j , k,  £ ) < R(i,j)RCk,£)  + R(i,*)R(k,j)  + R ( i , k)  R ( j , l ) , and  where 


R ( i , j ) < Ma  1 


for  some  a e (0,1),  and  real  M.  Then  the  sum 


in  (All. 3)  is  bounded  above  by 


(11.4) 


qN (t+s) -i 

I 

i» j=qN  (t) 


(AtN+iAtN+j } 


and,  in  turn,  the  bound  in  (All. 3)  can  be  verified  from  (11.4). 
Note  that  (All. 3)  implies  that: 


* ■-  »*• 


(11.5)  For  each  t,  there  is  an  M^Ct)  < 00  such  that 
E|WN(t)|4  < M1(t),  all  N. 

Under  (All. 3),  (11.2)  holds  and  (WN(*)}  is  tight  on 

D [0 ,°°)  , and  the  paths  of  any  limit  process  are  continuous  w.p.l. 
Assume  that  {U^}  is  tight  on  Rq.  Then,  (UN(*)}  is  tight  also. 

We  will  prove  that  the  limit  of  {W  (•)}  is  a Wiener  process 

OO 

with  covariance  Rn(0)  = R(9;0)  +2  ) R(0;i).  First,  some 

0 i=1 

estimates  are  needed.  Let  ^^(t)  be  the  smallest  a-algebra  which 
measures  {WN  (s) ,UN (s) , s < t},  and,  for  notational  simplicity,  fix 

s,t,  and  set  = N + qN(t). 

By  (All . 1 ) , and  the  definition  of  WN(t+s)  - WN(t), 


(11.6) 


wN(t+s)-wN(t)  | < l (AtN+i)'L/^ai  (i+K 


qN(t+s)_1 


1/2  N 


i=qN(t) 


V 


We  can  write 


qN{t+s)_1 


(U.7)  (WN(t+s)-WN(t))2  = l (AtN+iAtN+j)  E«  (tjWwj 

1 i/i=qN(t)  N 


qN(t+S)_1 


AtN+iE^N(t)^N+i 


qN(t+s) 


-fc-1 


1/2 

+ 2 l l (AtN+iAtN+i  + «,)  (t)^N+i^N+i+ii* 

«,>!  i=qN(t)  N 


By  (All. 2),  we  can  rewrite  (11.7)  as 


•**  r ’ ■*. 
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gN(t:+s)-l  <^(t+s)-£-i 

R(0:0’  AW  + 2.J  8(0;i i)  . I . 


(11.8) 


+ FN(t,s)  , 


i=qN(t) 


'N+i  N+i+£ 


where  the  coefficients  of  the  R(0;O)  or  2R(9;£)  tend  to  s, 
as  N ->■  »,  and  where 


qN  (t+s)-l 


(11.9)  | F 


N(t,s)|  <2(1+|^|2).  . I e*fjCAtN+iAt  J1/2. 

^ ifD=q„(t) 


Let  hCO  be  a real  valued  bounded  continuous  function  on 
some  Euclidean  space  R2q,  with  tlf...,t  arbitrary  real  numbers 
< t.  Let  { UN  C - ) ,WN(-)}  denote  a weakly  convergent  subsequence. 
Let  the  limit  process  be  denoted  by  U(*),W(*).  By  (11.6)  and 
the  uniform  integrability  (11.5)  and  (All.l), 


(11.10)  Eh(WN(ti),  UN(ti),  i < q)  (WN(t+s)  - W14  (t)  ) ->  0, 


By  weak  convergence,  and  the  uniform  integrability  (11.4),  the 
left  side  of  (11.10)  also  tends  to 


■*  i ' . 


«v  *»'*•*  ■?" 
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Eh(W(ti)  ,U(ti)  , i < q)  (W(t+S)  - W (t)  ) , 


which  must  thus  equal  0.  Similarly,  (11.8),  (11.9),  (All. 2) 

and  the  weak  convergence  and  uniform  integrability  (11.4) 
yield  that 


Eh(W(t.)  ,U(ti)  , i < q)[(W(t+s)  - W(t)P  - R (0)s]  = 0. 


The  last  paragraph,  together  with  the  arbitrariness  of 
h(*)/  t,s,  ti  < t,  and  the  continuity  of  W(*)/  w.p,l,  imply  that 
W (• ) is  a continuous  martingale  with  quadratic  variation  RQ(0)s, 
hence  it  is  the  asserted  Wiener  process. 


t ' t ' ' -t  . • 


42. 


REFERENCES 


[1]  M,  T,  Wasan,  Stochastic  Approximation,  Cambridge  University 
Press,  Cambridge,  1969. 

[2]  L.  Ljung,  "Convergence  of  Recursive  Stochastic  Algorithms", 
Report  7403,  1974,  Lund  Institute  of  Technology,  Division 
of  Automatic  Control,  Lund,  Sweden. 

[3]  L.  Ljung,  T.  Soderstrom,  I.  Gustavsson,  "Counterexamples  to 
General  Convergence  of  a Commonly  Used  Recursive  Identifica- 
tion Method",  IEEE  Trans,  on  Automatic  Control,  AC-20,  1975, 
pp.  643-652. 

[4]  V.  Fabian,  "Stochastic  Approximation  of  Constrained  Minima", 
Proc.  4th  Prague  Conference  on  Statistical  Decision  Theory 
and  Information  Theory",  1966,  pp.  277-289. 

[5]  H.  J.  Kushner,  H.  T.  Gavin,  "Stochastic  Approximation-like 

Algorithms  for  Constrained  Systems:  Algorithms  and  Numerical 

Results",  IEEE  Trans,  on  Aut.  Control,  AC-19,  1971, 

pp.  349-357. 

[6]  H.  J.  Kushner,  "Stochastic  Approximation  Algorithms  for 
Constrained  Optimization  Problems",  Ann.  Statist.,  2J1974), 
pp.  713-723. 

[7]  H.  J.  Kushner,  E.  Sanvicente,  "Stochastic  Approximation  for 
Constrained  Systems  with  Observation  Noise  on  the  System 
and  Constraints",  Automatica,  1^1(1975),  pp.  375-380. 

[8]  H.  J.  Kushner,  E.  Sanvicente,  "Penalty  Function  Methods 
for  Constrained  Stochastic  Approximation",  J.  Math.  Anal, 
and  Applic.,  £6  (.1974)  , pp.  499-512. 

[9]  H.  J.  Kushner,  M.  L.  Kelmanson,  "Stochastic  Approximation 
Algorithms  of  the  Multiplier  Type  for  the  Sequential  Monte 
Carlo  Optimization  of  Constrained  Systems",  SIAM  J.  on 
Control,  1£,  August  1976. 

[10]  H.  J.  Kushner,  "General  Convergence  Results  for  Stochastic 
Approximations  via  Weak  Convergence  Theory",  to  appear 

J.  Math.  Anal,  and  Applic. 

[11]  A.  Miele,  E.  G.  Cragg,  R.  R.  Iyer,  A.  V.  Levy,  "Use  of 
Augmented  Penalty  Function  in  Mathematical  Programming 
Problem^',  Part  I;  J.  Optimiz.  Theory  and  Applications, 

8 (1971) , pp.  115-130. 


**  f 


43. 


[12]  V.  Fabian,  "On  Asymptotic  Normality  in  Stochastic  Approxima- 
tion", Ann.  Math.  Statist.,  3_9  (1968)  , pp.  1327-1332. 

[13]  J.  Sacks,  "Asymptotic  Distribution  of  Stochastic  Approximation 
Procedures",  Ann.  Math.  Statist.,  29^(1958),  pp.  273-405. 

[14]  D.  J.  Scott,  "Central  Limit  Theorems  for  Martingales  and  for 
Processes  with  Stationary  Increments  Using  a Skorokhod 
Representation  Approach",  Adv.  in  Appl.  Prob.  5^(1973), 

pp.  119-137. 

[15]  D.  I.  Bertsekas,  "Multiplier  Methods:  A Survey", 

Automatica,  12(1976),  pp.  133-145. 


[16]  P.  Billingsley,  Convergence  of  Probability  Measures,  Wiley, 
New  York,  1968. 

[17]  W.  Whitt,  "A  Guide  to  the  Applications  of  Limit  Theorems  for 
Sequences  of  Stochastic  Processes",  Oper.  Res.,  18(1970), 
pp.  1207-1213. 

[18]  H.  J.  Kushner,  Probability  Methods  for  Approximations  in 
Stochastic  Control  and  for  Elliptic  Equations,  Academic 
Press,  New  York,  to  appear,  late  1976  or  early  1977. 

[19]  D.  E.  Iglehart,  "Diffusion  Approximations  in  Applied 
Probability",  Math,  of  the  Decision  Sciences,  Part  II, 
Lectures  in  Appl.  Math.,  .12  (1968),  Amer.  Math.  Soc., 
Providence,  R.I.. 

[20]  T.  Lindvall,  "Weak  Convergence  of  Probability  Measures  and 
Random  Functions  in  the  Function  Space  D[0,°°);  J.  Appl. 
Prob.,  10(1973),  pp.  109-121. 

[21]  B.  T.  Polyak,  "Iterative  Methods  Using  Lagrange  Multipliers 
for  Solving  Extremal  Problems  with  Constraints  of  the 
Equation  Type",  Zh.  Vychisl.  Mat.  mat.  Fiz.,  10  (1970) , 

pp.  1098-1106  (Transl.  pp.  42-52). 

[22]  H.  J.  Kushner,  S.  Lakshmivarahan , "Numerical  studies  for 
constrained  stochastic  approximation",  to  be  submitted  to 
IEEE  Trans,  on  Automatic  Control. 


III.  NUMERICAL  STUDIES  OF  STOCHASTIC  APPROXIMATION 


PROCEDURES  FOR  CONSTRAINED  PROBLEMS 


by 


Harold  J. 


Kushner+, 


S. 


Lakshmivarahan 


++ 


August  1976 


+Brown  University,  Providence,  Rhode  Island,  Lefschetz  Center  for 
Dynamical  Systems.  Supported  by  the  Air  Force  Office  of 
Scientific  Research  AF-AFOSR  76-3063,  National  Science  Foundation 
Eng  73-03846-A01 , and  Office  of  Naval  Research  NONR  N000  14-76-C- 
0279. 

++Brown  University,  Providence,  Rhode  Island.  Lefschetz  Center  for 
Dynamical  Systems.  On  leave  from  I.I.T.,  Madras,  India. 

Supported  by  the  Air  Force  Office  of  Scientific  Research 
AF-AFOSR  76-3063  and  the  Office  of  Naval  Research  NONR  N000  14-76- 
C-0279 . 


NUMERICAL  STUDIES  OF  STOCHASTIC  APPROXIMATION 
PROCEDURES  FOR  CONSTRAINED  PROBLEMS 


ABSTRACT 

Four  algorithms  for  stochastic  approximation  under  equality 
and  inequality  constraints  are  discussed  and  described,  together 
with  numerical  data  and  compariosns  from  numerous  simulations. 

The  algorithms  work  well,  and  exhibit  some  rather  interesting 
behavior.  Those  based  on  "augmented  Lagrangian"  techniques  are 
preferable  to  the  "Lagrangian"  methods,  as  in  the  deterministic 
case;  the  former  methods  seem  to  be  quite  robust  and  reliable. 

The  study  is  the  first  (to  the  authors*  knowledge)  numerical  study 
of  such  algorithms. 


NUMERICAL  STUDIES  OF  STOCHASTIC  APPROXIMATION 
PROCEDURES  FOR  CONSTRAINED  PROBLEMS 

Harold  J.  Kushner  and  S.  Lakshmivarahan 

1.  Introduction 


The  purpose  of  this  paper  is  to  discuss  several  recently- 
developed  algorithms  for  constrained  stochastic  approximation 
(.SA) , and  to  give  some  typical  numerical  results,  taken  from  a 
large  number  of  simulations.  Most  work  on  SA  has  dealt  with  the 
unconstrained  case  (see  references  in  Wasan  [1] , and  Ljung  [2] , 
for  example).  Nevertheless,  there  are  numerous  applications 
for  the  constrained  problem  - which  has  received  relatively 
little  attention;  see  Fabian  [3],  Kushner  [4],  Kushner  and 
Gavin  [5],  Kushner  and  Sanvicente  [6],  [7],  Kushner  and 
Kelmanson  [8],  Kushner  [9],  [10].  Some  of  the  many  possible 
applications  are  described  in  [6]. 

References  [3] , [7]  dealt  with  penalty  type  methods, 
references  [4],  [5]  with  stochastic  analogs  of  the  methods  of 
feasible  directions,  [6]  with  a Lagrangian  method,  [8]  with 
several  methods  of  the  "augmented"  Lagrangian  type  and  [9], 

[10]  dealt  with  more  general  convergence  proofs,  and  results 
on  rates  of  convergence  and  exploited  ideas  in  the  theory  of 
weak  convergence  of  a sequence  of  probability  measures  - to 
obtain  general  results  rather  efficiently.  While  our  under- 
standing of  the  theoretical  properties  of  the  SA  algorithms  for 
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the  constrained  problem  is  progressing,  virtually  nothing  is 
known  about  the  numerical  properties.  Obviously,  an  understand- 
ing of  the  numerical  problem  is  important  for  potential 
applications.  But,  it  also  points  toward  needed  further 

theoretical  development.  As  far  as  the  authors  are  aware,  this 
is  the  first  paper  (besides  [5])  to  discuss  numerical  results  for 
constrained  SA. 

Reference  [5]  discussed  some  numerical  results  for  the 
stochastic  methods  of  feasible  directions.  While  the  results 
there  were  interesting,  and  exhibited  some  of  the  salient 
features  of  the  stochastic  problem,  some  problems  were  en- 
countered with  the  behavior  of  the  iterates  on  the  boundary 
(the  procedure  seemed  too  inefficient  there) . These  problems 
suggested  the  importance  of  investigating  "dual"  type  methods. 

This  paper  is  devoted  to  a description  (with  numerical  data) 
of  several  such  methods:  there  is  little  doubt  that  (if  non- 

feasible  iterates  are  allowable)  their  performance  is 
considerably  superior  to  the  performance  of  the  methods  in  [ 5 ] . 

In  many  respects,  the  behavior  of  the  constrained 
algorithms  is  better  than  the  behavior  of  the  standard 
(unconstrained)  SA.  The  iterates  are  bounded,  and  the  con- 
straints often  force  the  iterates  to  remain  near  a boundary, 
thus  further  restricting  the  state  space.  It  would  appear, 
from  our  simulations,  that  there  should  be  no  more  hesitation 
in  using  the  "constrained"  algorithms,  than  there  would  be  for 
"unconstrained"  SA.  Also,  there  are,  undoubtably,  applications 
to  areas  such  as  identification,  which  have  not  yet  been  explored. 
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The  convergence  proofs  for  all  the  algorithms  given  here  - 
appear  elsewhere  [6] , [8] . The  main  concern  of  this  paper  is  with 

the  numerical  properties,  and  with  a description  of  the  observed 
path  behavior  and  its  dependence  on  the  parameters  of  the  algorithm. 
Section  2 treats  an  equality  constrained  problem  (algorithm  1 of 
[8]).  Section  3 treats  a Lagrangian  method  (from  [6]),  and 
Sections  4 and  5 treat  the  "augmented"  Lagrangian  algorithms  3 and 
4,  resp.  of  [8],  The  algorithms  worked  surprisingly  well. 

All  the  reported  simulations  are  on  two  dimensional  problems, 
except  for  some  results  in  Section  4,  which  concern  a 5 
dimensional  problem. 

2 . An  Equality  Constrained  Algorithm 

The  problem  and  the  algorithm.  The  problem  is  to 
sequentially  minimize  f(-),  under  constraints  (x)  = 0,  i = l,...,s. 
The  function  f(-)  will  always  be  real  valued,  be  defined  on  the 
Euclidean  parameter  space  Rr,  and  have  bounded  second  derivatives. 
The  functions  <)>^(’)  ore  real  valued,  with  continuous  first 
derivatives,  and  are  assumed  known.  As  is  usual  in  SA,  the  form 
of  f(*)  is  not  assumed  known.  However,  at  any  selected  para- 
meter setting  (say  x) , a noise  corrupted  estimate  of  the 
performance,  namely  f (x)  + noise,  can  be  observed.  This  pro- 
vides our  only  information  on  f(»).  In  the  simulations, the  noise 
sequence  will  be  independent,  Gaussian  and  have  mean  zero. 

Let  {an,cnJ  denote  real  positive  null  sequences  (the 
t h 

second  being  the  n finite  difference  interval) , k a positive 


' T 
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2 fch 

scalar,  P(x)  = £ 1 4> . (x ) | , e.  the  unit  vector  in  the  i 

i 

coordinate  direction,  and  {Xn>  the  sequence  of  estimates  of  the 
constrained  minimum,  which  we  always  denote  by  0.  Define  the 
matrix  $(x)  = [ 4>  _ (x),...,<J>  (x)  ] ' , where  ' is  transpose  and 

1 / X S / X 

the  subscript  x on  4>  . (•)  denotes  gradient.  Let  the 

l , x 

operator  it  (x)  be  defined  by:  (I-ir(x))y  is  the  projection 

of  a vector  y onto  spanfij).  (x),...,0  (x)  ] ; i.e.,  iT(x)y 

1 , X S t X 

is  the  projection  of  y onto  the  tangent  line  or  plane  to  the 
curve  or  surface  {z:  <f>(z)  = <f>(x)},  at  z = x.  The  vectors 
6f(x,c)  and  <$f(x,c)  defined  by 


6f(x,c)  = [6f 1 (x,c) , . . . , 6fr (x,c) ] ' = 


J 


= 2^—  [f  (x+e^c)  - f (x-e^^c)  , 

Ln 


6f(x,c)  = — [f (x+e^c)  - f(x), 
n 


are  the  central  difference  and  one-sided  difference  estimates 

of  the  gradient  f^fx).  Let  the  "noisy"  gradient  estimate 

(central  difference)  6Y  = (6Y1 , . . . , 6Yr ) ' be  defined  by 

n n n 


[f (X  +e.c  ) - f (X  -e.c  )]  [£  ■“?_•] 

.„i  n l n n l n , n,x  n,i 

6Yn 2c + 

n n 


- {£l<Vcn>  + ^/2cn  ■ 


where  £ + . ,£~  . are  the  observation  noises  in  the  measurements 
n , x n, x 

of  performance  at  X ± e.c  , resp. . Define  SY  similarly, 

n x n n 

by  6Y^  = 6fi(X  ,c  ) + £;Vc  , where  now  £*  = £ + . - £;°  . , E,°  . 
n n n ^n  n ^n  n,x  sn,x  sn,x 


being  the  observation  noise  in  the  performance  measurement  at  x . 

n 

Algorithm  1 of  [8|  is  a stochastic  version  of  an 
augmented  Lagragian  algorithm  of  Miele,  et.  al  [11] , and  takes 
the  form  (using  central  dif f erences) + 


(2.1) 


xnj  i = X - a [tt  (X  ) <5Y  + £ P (X  )] 

n+l  n n n n 2 x n 


- Xn  - an""Xn,SflVcn)  + * < W2c„  + 7 Px  (Xn' 1 


In  all  the  simulations  of  the  paper,  if  the  step  size 

! Xn+^  - Xn|  obtained  from  (2.1)  was  > i-,  we  truncated  it  and 

defined  Xn+^  by  (|*|  is  the  Euclidean  norm) 

y _ y step  from  (2.1) 

n+l  n 2*  [step  f i -5m  (2.1)  | 


Such  a truncation  is  clearly  necessary  in  any  SA  procedure,  to 
assure  that  the  sequence  will  not  be  oversensitive  to  large  errors 
(noise,  bias)  or  overrelianue  (resulting  in  steps  of  excessive 
size)  on  values  of  f^(x)  or  6f(x,c)  in  the  early  stages  of  the 
procedure. 

Under  some  additional  conditions  on  a ,c  , £ (which  will 

n n n 

always  be  satisfied  here)  (Xn)  converges  w.p.l  to  a point  0, 
which  satisfies  the  necessary  condition  of  the  calculus  for  a 
constrained  minimum  (original  proof  in  [8] , weakened  conditions 
in  [9] ) . 

+The  algorithm  in  [8]  uses  fy(Xn)  + noise  for  the  observation,  but 
(as  for  the  other  algorithms  of  the  sequel) , the  finite  difference 
form  and  proof  require  only  small  changes. 
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In  the  simulations  to  be  reported  on  we  used  f(-)  of 
the  quadratic  form 

C2.2)  f(x)  = (x^+2 . 5 ) 2 + (x2  - j)2/4 

and  (only  on^Cftnstraint  in  2-space  variables) 

(2.3)  c)>(x)  = -sin  x1  + x2- 

Actually,  other  constraints  were  studied,  but  (2.3)  allows  a 

fairly  stringent  test,  and  exhibition  of  the  features  of  the 

2 

algorithm  - at  least  in  R . 

The  finite  difference  estimate  <5f(x,c)  of  f^(x)  (for 

(2.3) )  is  unbiased.  In  order  to  add  bias,  without  getting 

involved  in  a choice  of  a more  complicated  function,  we  used 
6f  (x,c) , in  (2.1),  in  lieu  of  6f(x,c),6Yn  - in  all  the 
simulations.  Note,  however,  that  the  biases  for  a general, 
smooth,  f(*)  satisfy 

|fx  (x)  - 6f 1 (x, c) | % |f  (x)c2|/3! 

i iii 

(2.4) 

If  (x)  - 6 f’i  (x,  c)  | % |f  (x)  c | /2 ! , 
l xixi 

and  so  our  "trick"  has  created  a somewhat  larger  bias  (for 
small  cn)  than  we  would  get  for  a general  smooth  f(*).  Thus, 
with  use  of  a central  difference  estimate,  the  algorithm  would 
actually  perform  better  than  indicated  by  the  simulations. 


! 
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The  form  of  {a  ,c  }.  Let  n.  denote  an  integer  (=  100 

in  all  the  simulations),  let  a,y,6,A  and  C be  positive  real 

numbers  with  6 > a.  Define  the  sequence  {mn}  as  follows: 

mn  = m.  =1;  for  n>2,m  =m  + 2 if  the  angle  formed  by  the 
u i — n n-i 

3 points  (X  _,X  .,X  ) is  less  than  90°.  Otherwise  m = m , . 

n-2  n-1  n n n-1 

5 Cl  Y 

Set  an  = A/m^,  n < nQ;  aR  = A/mn,  n > nQ,  cn  = C/(n+l) ' . 

The  method  of  calculating  an,cn,mn  provides  a kind  of 

adaptation.  If  the  angles  are  generally  > 90°,  we  "presume" 

that  the  sequence  is  moving  more  or  less  monotonically  toward 

its  goal.  If  the  angle  is  < 90°,  we  presume  that  it  was 

caused  by  either  a noise  effect  or  an  overshoot,  and  increase 

m , hence  decrease  a . The  increase  in  m is  2,  since,  in 
n n n , 

the  pure  noise  situation,  the  angle  will  be  less  than  90° 

one-half  of  the  time,  and  we  want  the  average  long  term  value 

of  m to  be  n (which,  indeed,  it  was  - modulo  reasonable 
n 

statistical  fluctuations) . 

The  break  in  the  rate  of  decrease  of  a , at  n = 100  , 

n 

is  for  the  following  reason.  Clearly,  some  form  of  adaptation 
is  ncessary.  The  problem  of  choice  of  an'cn  an  the  constrained 
problem  is  much  more  difficult  than  it  is  in  the  unconstrained 

problem.  However,  experience  with  SA  simulations  indicates 

that  it  can  be  harmful  if  a decreases  too  rapidlv 

n 

in  the  initial  phases.  The  value  of  a^  may  become  too  small, 
while  the  iterates  are  still  quite  far  from  the  region  of  the 
constrained  minimum:  a should  not  be  allowed  to  decrease  too 
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fast  until  it  is  clear  that  either  overshoots  are  becoming  a 
problem  (which  is  not  usually  too  serious  when  there  is  a step  size 
limitation)  or  that  noise  effects  begin  to  play  a serious  and 
steady  role.  The  choice  taken  above,  involving  the  n^,  and 
8 < a,  was  a simple  (and  effective  within  the  context  of  the 
simulations)  attempt  to  deal  with  this.  The  topic  will  be  re- 
turned to  in  connection  with  the  discussion  below  concerning 
the  rate  of  convergence. 

Theoretical  rate  of  convergence . + Let  ir(x)f  (x)  = 

(v1 (x) , . . . , vr  (x) ) ' , and  w(x)  = [v1  x (x)  , . . . , v x (x)  ] ' . The  matrix 

w(9)  is  (in  the  proper  coordinate  system)  the  Hessian  matrix  (at  6) 

of  the  function  f(-),  defined  on  the  constrained  surface. 

Define  K = w (0  ) +k$'(0)$(0),  let  y < a < 1 , 3 = min  [2y  ,a/3]  , 

3 = min [y, a/4]  and  B = (B^,...,B  ),  B = (B^ , . . . , B^ ) , where 

B.  = (f  (0)/3!  ) and  B.  = f (0)/2.  Assume,  for 

1 xx. x . l x.x. 

ill  ii 

simplicity  (and  it  is  also  true  in  our  simulations)  , that 

2 2 

CovtC  I il  -*■  o I,  for  some  constant  a ; as 

n1  01  n-1 

—1  —2 

n -*•  °°.  Define  U ,U  to  be  random  vectors  whose  distributions  are 
the  stationary  distributions  of  the  solutions  to  (2.5a,b)  resp. , 
(which  will  exist,  under  the  conditions  below).  Let  (in  this 

Ct  Y 

subsection  only)  an  = A/(n+l)  , cn  = C/(n+l)  . 

+This  subsection  quotes  a result  from  [10] , which  is  not  yet 
published.  The  result  helps  to  explain  some  of  the  numerical 
data  and  the  problem  of  parameter  selection,  but  the  result  is 
not  critical  here. 
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(2.5a)  dU1  (t ) = -(AK-3I)U1(t)dt  - AC2Bdt  - (A/2C)odw, 

(2.5b)  dU2(t)  = -AK  U2(t)dt  - AC2Bdt  - (A/2C)adw, 

where  w(*)  is  a standard  Wiener  process. 

The  following  result  is  in  [10],  and,  after  some  additional 
work,  can  also  be  obtained  in  part  from  [12] . We  do  not  list  all 
the  conditions  for  validity,  but  they  all  hold  for  our  case.  Here, 

A is  either  a scalar  or  a matrix. 

I.  Use  the  central  difference  in  (2.1). 

(la)  Let  a = 1,  and  B = 2y  = a/3,  and  assume  that  all 
the  eigenvalues  of  - (AK-8I ) = -K^  have  negative  real  parts. 

Then  n^(X  -0)-^->-  U^"  (in  distribution). 

(lb)  Let  a < 1,  3 = 2y  = a/3,  and  let  the  eigenvalues  of 
-AK  = -K2  have  negative  real  parts.  Then  np  (Xn~8 ) ■>  U . 

(lc)  If  2y  < a/3,  the  noise  is  less  important  than  the 
bias,  and  the  above  (Ia  or  b)  holds  with  a = 0 in  (2.5).  If 

2y  > a/3,  the  bias  is  less  important  than  the  noise,  and  (Ia,b)  holds 
with  B = 0 in  (2.5). 

(ld)  If  (Ia)  holds  except  for  the  eigenvalue  criterion* 
but  the  eigenvalues  of  -(AK-bl)  have  negative  real  parts  for 
some  be  (0,3)*  then  n D(X  -0)-^->O.  There  is  an  analogous 
version  of  (Ic ) . 

II.  Use  the  one  sided  difference  <SY  in  (2.1).  Then 
— n — 

(Ia-d)  continue  to  hold,  with  A/C,3*B  and  y > a/4  (or  =) 
replacing  A/2C,3*B  and  2y  > a/3  (or  =) , resp. 


3.  ' T ’ t 
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On  the  choice  of  A,C ,a ,y > 6 , returned.  In  Case  I, 

a = 1,  y = yields  the  best  asymptotic  rate,  and  a = 1, 

y = i does  so  in  Case  II.  The  rate  result  can  also  be  used 

as  a basis  for  choosing  A and  C.  But,  note  that  selecting 

the  parameters  via  a minimization  of  the  normed  asymptotic 

variance  must  be  done  with  great  care.  It  is  not  clear  what 

"asymptotic"  means,  when  the  number  of  iterations  is  only  in 

the  hundreds,  or  is  even  less.  Often  using  a = 1 led  to 

relatively  poor  results  in  the  first  few  hundred  iterations. 

The  an  decreased  too  fast,  and  often  left  the  iterate  sequence 

"stuck"  away  from  the  vicinity  of  0.  So,  we  prefer  to  use 

a < 1.  The  value  a = f-  seemed  to  yield  reasonably  good 

b 

results.  In  the  runs,  with  a = 1,  when  there  was  a reasonably 
rapid  (within  the  first  100  iterations)  "initial"  convergence 
to  a small  enighborhood  of  0,  then  a = 1 performed  (as 
expected)  slightly  better  than  similar  runs  with  a = |-;  there 
was  a little  more  "noise  stability".  When  a was  reduced  below 
5 

, the  behavior  worsened  (the  "asymptotic"  noise  effects 
b 

increased  - without  a compensating  improvement  in  the  initial 

behavior) . We  also  settled  on  y = i,  which  is  between  the 

requirements  of  I and  II.  Note  that  the  simulations  used 

a = A/m^  or  A/ma . 
n n n 

It  is  difficult  (and  perhaps  dangerous)  to  generalize  from 
a few  simulations.  However,  on  the  average  (of  20  run  sequences 
of  500  iterations  each)  for  the  case  of  Figure  1.1,  where  k = 1 
and  A = i-,  the  value  a = |-  did  several  percent  better  than 
did  the  value  a = 1;  a = ^ always 
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did  better  when  the  convergence  in  the  first  100  iterates  was 

relatively  poor.  On  the  other  hand,  a series  of  20  simulations 

with  a = 1,  A = did  (a  few  percent)  better  than  did  a series 

with  a = g-  , A = (the  averages  f (X^^q)  were  within  .04 

and  .06  of  the  minimum  value,  and  the  x500  points  were  fairly 

5 

close  to  the  curve  in  both  cases)  . It  was  found  that  a = was 

b 

a reasonable  overall  choice. 

As  is  clear  from  the  rate  formula,  the  optimal  (asymptotic) 

value  of  A depends  on  the  Hessian  of  f(')  on  the  constrained 

surface  and  on  4>  ( ■ ) at  0.  We  took  a relatively  simple  method 

for  selecting  A - by  assuming  that  the  problem  was  unconstrained, 

and  making  a rough  estimate  (from  data)  of  the  appropriate  A for 

the  unconstrained  problem,  and  then  using  a slightly  smaller  value. 

In  fact,  the  optimal  (asymptotic)  value  of  A for  a one-dimensional 

unconstrained  iteration  in  the  direction  is  20/f"(0), 

provided  that  a = l,  y = 4,B=0.  If  a<l,  then  smaller  values 

b 

of  A are  preferable.  Larger  values  did  not  perform  as  well. 

2 

The  value  6 = was  used,  again  chosen  via  trial  and  error. 
Larger  values  have  the  same  liabilities  as  do  larger  values  of  a, 
and  with  smaller  values,  the  cummulative  noise  effect  was  generally 
too  great. 

Note  on  the  choice  of  k.  In  various  deterministic 
forms  of  the  agumented  Lagrangian  method,  the  algorithm  yields 
-*■  0 only  if  k > minimum  value  > 0 [13J.  Here  k > 0 is 

arbitrary  [8J.  But,  we  see  from  the  rate  result  that  the  rate 
depends  on  k,  and  may  be  low  if  k is  too  small.  Such  a 
phenomenon  was  observed  in  the  simulations,  and  provided  one  of  the 


■ r 


A 


12. 

motivations  for  the  derivation  of  the  rate  expressions.  Of 
course,  if  we  increase  A,  then  k is  effectively  increased,  as 
far  as  the  rate  is  concerned. 

Numerical  data.  See  Figures  2. 1-2. 3,  for  some  typical  runs, 

2 

each  of  500  iterations.  Note  that  a = 2,  a fairly  large  value. 

We  have  f(0)  = .255  and  9 = (-2. 7, -.43).  When 

we  observed  f (X  ) + noise  in  lieu  of  <5Y  = 67 (X  ,c  ) + 

x n n n n 

noise/2 c , the  runs  were  excellent,  converging  rapidly  and 

2 

directly  to  0,  even  with  a =2. 

The  figures  illustrate  the  effect  of  k.  Figures  2.1 

and  2.2  are  similar,  except  for  the  pronounced  (expected) 

compression  of  the  trajectory  about  the  curve  in  Figure  2.2. 

The  oscillations  (and  their  orientation  with  respect  to  the 

constraint  curve)  in  Figure  2.1  are  caused  by  the  fact  that  from 

iteration  50  on,  the  term  it  (X  ) f (X  ) % it  (X  ) 5f  (X  ) is  fairly 

n x n n n J 

small,  and  ir  (X  ) E /c  fairly  large.  Both  vectors  lie  in  the 
n n n 

tangent  line  to  the  curve  { y : <t>  (y)  = <)>  (x)  } at  x = X , and  so 

the  movement  is  largely  random  - and  oscillates  along  the 

"tangent  lines".  The  kP  (X  ) term  pulls  the  iterate  to  the 

x n 

curve.  (In  Figures  2.1  to  2.3,  we  see  something  like  a one- 
dimensional unconstrained  SA  - stretched  into  a helix  to  emphasize 
the  rather  random  "oscillating"  approach  to  0.)  If  k is  too 
small,  then  the  movement  to  the  curve  is  too  slow,  and  since 
an  > 0,  the  procedure  virturally  grinds  to  a halt  (as  happened 
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with  k = .1).  A similar  phenomenon  is  illustrated  by  Figure  2.3 


which  we  show  to  illustrate  the  phenomenon  only  - the  run  itself 
was  worse  than  the  average.  It  seems  preferable,  as  in  the 
deterministic  case,  to  select  the  largest  k for  which  the 
numerical  process  is  reasonably  well  conditioned.  Of  course,  k 
can  (should)  be  adjusted  in  the  course  of  the  iterations. 

Table  2.1  illustrates  the  effect  of  k.  The  numerical 
averages  of  <j>  (X50Q)  , f (X50Q)  and  numerical  variance  of  f(X5Q0) 
over  20  independent  runs  are  listed.  The  data  (besides  k)  is 
that  of  Figures  2.1  or  2.2. 


k = \ 

0 (X5oo) 

-.0168 

f (X500} 
. 3090 

var  f(X500) 
. 00194 

1 

-.0126 

. 3046 

. 00228 

4 

-.0013 

. 3003 

.00241 

Table  2.1.  The  effect  of  k 

The  larger  k are  preferable,  but  not  by  much.  As  k 
decreased  below  , the  performance  deteriorated  seriously. 

Let  us  return  to  the  rate  result.  It  can  be  shown  [10] 
that  one  eigenvalue  of  is  proportional  to  k;  of  course  - 

both  are  proportional  to  A.  Let  A be  a scalar  = and  use 
the  data  of  Figure  2.1.  If  k = 2,  the  eigenvalue  proportional 
to  k has  value  .909,  and  the  other  .305.  Thus,  k can  be 
reduced  to  2 ( . 305) / ( . 909)  before  the  convergence  rate  is 
dominated  by  the  eigenvalue  which  is  proportional  to  k.  This 
is  not  inconsistent  with  our  numerical  observations. 
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3 . A Lagrangian  Algorithm 


The  algorithm.  All  undefined  terms  are  as  in  the 
previous  sections.  Consider  the  Lagrangian  algorithm  of  [ 6 ] 
for  minimizing  f ( - ) subject  to  inequality  constraints 

q^x)  < 0,  i = l,...,s.  Unless  otherwise  mentioned,  the  qi(«) 
are  known  and  continuously  differentiable  (also  in  Sections  4,5). 
Data  on  f (.• ) is,  again,  obtained  only  via  noise  corrupted 

observations.  (The  algorithm  is  valid  if  the  q^(-)  are  unknown, 
provided  that  noise  perturbed  observations  of  the  constraint 
functions  are  taken;  this  will  be  commented  on  below.) 


Let  M?"  and 

M 

denote 

numbers  such 

that 

| e1 1 < m 

X 

and  the  multiplier 

X 

for 

q±(*) 

at  0 is  < 

Mx  • 

i i _ 

Let  {bn 

a null  sequence  of  positive  real  numbers,  and  q(x)  = 

(qx  (x)  , . . . ,qs  (x)  ) ' . Define  bY  Xn  = (X*,...,X®)'  and 


(3.1a) 


X = X - a f6Y„  + l xV  (X  )] 
n n n n n l ,x  n 

A^+1  = max[0,min{M^,  A^+b^Un)  }]  , 


(3.1b) 


= X1  if  lx1) 
n+1  n 1 n 1 


< M 

x 


= M if  X1  > NI 
x n x 


= -M  if  X < -M  . 
x n x 


Again,  the  maximum  size  of  lXn+l  “ Xnl'l^n+1  ~ | was 

limited  to  j . In  our  simulations,  we  used  6T(Xn,cn)  in  lieu 
of  6f(Xn,cn)  in  order  to  artificially  create  a sizeable  bias. 

Under  convexity  conditions  on  f(*)  and  q('),  and  under 


conditions  (which  hold  in  our  simulations)  on  {a  ,b  ,c  ,£  }, 

n'nn^n 

X -*■  d , the  optimal  point,  w.p.l,  as  n -+  °°  [6].  It  was  not 

proved  that  the  multipliers  An  converged  - although  they  did 

so  in  all  our  simulations,  when  the  q^(x)  were  convex  in  a 

neighborhood  of  0.  The  simulations  were  each  of  length  500; 

occasionally  if  it  did  not  appear  that  An  was  converging, 

we  took  a run  of  length  25,000.  Unless  the  noise  was  small 
2 1 

(a  < j)  or  fx(x)  + noise  observed,  the  convergence  of  {A  } 

wac  slow  - in  oscillatory  cycles  of  increasing  duration  and 

decreasing  variation.  When  some  q^(*)  was  not  locally  convex  at 

9,  the  situation  was  less  clear.  Sometimes  (A  } seemed  to 

n 

converge,  and  sometimes  the  variations  (max-min)  of  the  cycles  did 

not  obviously  appear  to  be  diminishing. 

In  the  convergence  proof  of  [ 6 ) an  = ^n  was  use<^'  ^ut 

the  proof  can  be  readily  modified  so  that  an  / b n,  provided  that 

both  sequences  decrease  at  the  same  rate.  Let  > 0 and 

> 0,  and  set  b^  = A^/m^,  an  = A^/m^,  n < 100,  and  bn  = A^/m^, 

a = A /ma , n > 100,  and  c = C/(n+l)Y. 
n x n n 

An  asymptotic  result. + Assume  that  A^  converges  to  some 
X.  If  q^(9)  = 0,  assume  X1  > 0.  For  the  asymptotic  result  we 
can  drop  all  constraints  for  which  qi(6)  < 0;  thus,  suppose  that 

+The  subsection  cites  a result  from  [10],  which  has  not  appeared. 
The  result  is  not  crucial  to  this  paper,  but  it  does  help  with 
the  understanding  of  the  numerical  data. 


VO  II  , 


q (0  ) = 0,  all  i.  Define  Q-  [c?i , x (0  ' ' * * ’ ,qs  ,x 

Q.  = Hessian  of  q.  (-)  at  9,  P = Hessian  of  f(-)  at  6 

1 • . . Dt 

Kq  = (F  + l X1Qi) , Ifc  = identity  matrix  in  R , 


(F  + I XlQ±) 


Q 


K, 


A I 
x r 


Vs 


K,  K. 


ei 


r+s 


Let  (in  this  subsection  only)  a =A  /(n+l)a,  b = A,/(n+l)a, 

J n x n X 

cn  = C/(n+l)Y.  Let  Cov[£n|  5o'*'*'^n-l^  o2Ir  as  n 
Let  U1,  i = 1,2,  denote  random  vectors  whose  distribution  is 
the  stationary  distribution  of  the  solution  to  (V1  is  s 
dimensional,  U1  is  r dimensional) 


' u1 

’ u1  ' 

A BC2 
X 

fA  o/2c) 

X 

= K 

- 

dt  - 

V1 

k 

v1 

0 

0 

where  K = or  K 2,  according  to  the  case.  Then,  if  we 

R p 

normalize  the  errors  as  (np(Xn~9),  np(An~A)),  the  asymptotic 

result  of  Section  2 holds  [ 10  ] ; also  for  the  one-sided 

_ 2 

difference  form  of  (3.1),  where  C,B  and  A^/C  replace  C ,B 

and  A^/20,  resp. 

The  asymptotic  rate  result  can  conceivably  be  used  to 
develop  some  sort  of  adaptive  method  for  adjustment  of  the 
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coefficients  A, ,A  . 

A X 

Numerical  data.  Refer  to  Figures  3.1  to  3.12.  Let 

q.^  (x)  = -sin  + x2 

q2(x)  = xx/4  - x2/2.5  - 1 
2 

q3(x)  = .1  (Xl-1)  - x2  - 3 ; 

(in  Figures  3.10-3.12,  q2(x)  = X]/71'  “ x2/2.5-l),  f(«)  is  as  in 
Section  2,  but  may  be  translated  and  rotated  as  indicated  in  the 
figures.  The  values  of  Ax,a,0,6,C  are  as  in  Section  2,  for  the 
same  reasons.  An  exception  is  that  6 = j in  Figures  3.2b,  3.4 
and  3.8  to  3.12,  since  we  plotted  runs  taken  before  the  value 

6 = j was  settled  upon.  When  the  ellipse  (f(*))  is  centered 
at  , the  constraints  are  locally  convex  at  0,  but  then 

there  is  a second  local  minimum  at  8^  = (-3.21,-0683)  with 
magnitude  10.2.  When  the  ellipse  is  centered  at  (- 2 or  at 
(4,0),  q^  C * ) is  not  locally  convex  at  0.  Also,  the  left-hand 
tail  of  the  constraint  set  is  "thin",  which  causes  some  problems 
when  we  do  not  assume  knowledge  of  q ( * ) , but  instead  take  noise 

corrupted  observations  of  the  form  q(Xn)  + observation  noise. 

Refer  to  Figures  3.1  to  3.4  (Figure  3.1  depicts  a rather 
poor  run),  where  the  ellipse  is  centered  at  . In  (the 

typical)  Figures  (3. 1,3. 2),  A^  = 2,  and  A^  = 5 in  Figure  3.3. 

+0  = (1.85, .961) , f (0)  = .396. 
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The  multiplier  X^  at  0 is  1.089.  We  set  Xg  = 0.  Since 
Xg  = 0,  the  procedure  starts  off  as  a pure  gradient  procedure, 
and,  initially,  heads  toward  the  unconstrained  minimum.  Thus,  soon, 
qi  (Xn)  > 0.  Then  X^  increases,  which  keeps  the  path  from 
wandering  too  far  from  the  constraint  set.  Progress  toward  9 is 
hindered  by  the  local  minimum  at  0^,  which  seems  to  divert 
the  path  until  iteration  8 or  so.  The  Lagrangian 

method  was  relatively  insensitive  (compared  with  the  algorithms  of 
Sections  4 and  5)  to  this  local  minimum.  (The  other  algorithms 
usually  converged  to  0^,when  Xg  was  to  the  left  of  0^.)  The 
asymptotic  rate  result,  and  the  fact  that  the  Hessian  of  q^(*) 
is  not  positive  definite  at  0^,  suggest  that  the  non-convex 
curvature  of  q^(-)  at  0^  makes  the  iterate  sequence  less 
stable  in  the  vicinity  of  0^. 

As  X^  increases,  the  path  moves  back  to  the  feasible  set, 
and  continues  to  leave  and  enter  the  set,  following  the  oscillations 
of  Xn,  which  in  turn  are  caused  by  the  variations  in  the  values  of 
the  qx(Xn)  sequence.  As  X*  increases,  the  effect  of 

X^q.  (X  ) on  the  (X  } increases,  helping  to  push  the 
n^l,x  n n 1 c 

iterates  to  the  constraint  set  (i.e.,  to  reduce  the  value  of 

q.  (x) ) . Once  X is  inside  the  constraint  set,  X decreases, 
In  n 

and  the  effect  of  the  gradient  f x (X  ) becomes  more  pronounced. 

All  this  is  as  expected. 

We  used  = 2 . The  value  of  thisupper  bound  is  quite 

important.  If  too  large,  the  oscillations  of  {Xn},  and  con- 
sequently those  of  {Xn},  i-n  initial  phases,  are  larger  - 

and  the  graphs  have  a substantially  wilder  appearance.  If 
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is  too  small,  then  the  limit  point  may  not  be  feasible.  Perhaps 
it  is  best  to  begin  with  a conservatively  small  value  - and  to 
increase  it  if  the  sequence  seems  to  be  converging  to  a 
non-feasible  point  - or  if  the  path  (Xn>  is  well  behaved. 

Observe  the  effect  of  (Figures  3.1,  3.3).  With  the 

larger  value,  the  oscillations  of  {An)  are  slightly  more  pro- 
nounced. With  A = 5,  the  results  were  preferable  to  those 

A 

for  A ^ = 2 . As  n increased,  the  period  and  the  magnitude  of 

the  { A } fluctuations  decreased  in  all  cases.  With  A = 1 
n X 

(not  plotted),  there  were  only  two  obvious  oscillations  in  the  first  500 
iterations;  they  were  fairly  smooth,  and  the  tail  of  the  second 
oscillation  appeared  to  be  settling  near  some  "steady  state" 
value.  Possibly,  more  insight  into  how  to  select  A ,A  can  be 

X A 

obtained  from  a study  of  the  asymptotic  result. 

A number  of  variations  on  the  method  of  adjusting  { An } 

were  tried  (in  the  absence  of  convergence  proofs) . In  some  runs, 

we  let  A ^ be  a diagonal  matrix,  whose  ifc^  element  (at  n) 

depended  on  the  sign  of  generally  being  larger  for 

q^  (Xn ) < 0 than  for  q^(Xn)  > 0.  The  results  were  not 

particularly  good,  but  the  method  was  motivated  by  a desire  to 

force  An  to  zero  as  rapidly  as  possible  when  the  iterates 

were  inside  the  feasible  region  and,  hence,  to  decrease  the  magnitude 

of  the  path  oscillations  when  in  that  region.  A method  that  worked 

somewhat  better  used  sign  q.  (X  ) in  place  of  q.  (X  ) in  (3.1a).  A 

in  in 

typical  run  is  plotted  in  Figure  3.2.  The  oscillations  of  {An}  were 


20. 


larger  - and  it  was  not  clear  whether  (An>  ever  converged.  But 

{ } did  appear  to  converge  - and  often  did  better  than  for  the 

original  scheme  (3.1).  Although  the  theoretical  properties  of  this 

modification  are  not  known,  it  is  suspected  that  some  such 

variation  of  the  An  iteration  formula  would  yield  more  stable 

{Xn>  paths  , and  nice  convergence. 

If  we  assumed  that  the  observation  was  f (X  ) + noise, 

x n 

instead  of  f(Xn)  + noise,  then  the  results  were  extremely  good; 
there  were  only  small  oscillations  of  the  (Xn>  path  as  it 
tended  to  0.  The  path  was  (roughly)  a smooth  version  of  those 
in  the  figures  until  iteration  20  or  so,  after  which  it  moved  to 
9 roughly  along  the  upper  boundary,  with  only  small  oscillations. 
Also,  (An>  settled  down  quite  fast.  See 

2 1 

Figure  3.8.,  a typical  run  for  the  value  a = j • 

Now  refer  to  Figures  3.5  to  3.8,  where  q] ( • ) is  not 
locally  convex  at  0,  and  the  multiplier  T1  at  0 is  0.464. 

Also,  9 = (-2 . 7 , - . 427 ) , f ( 0 ) = .255.  The  sequence  {X^} 
(experimentally)  converged,  despite  the  lack  of 

convexity.  In  Figures  3.5  and  3.6,  a2  = j . This  "non-convex" 

case  seemed  to  be  more  sensitive  to  noise  - but,  did  well  for 

2 1. 

~ J • Figure  3.7  illustrates  the  increased  oscillations  when 

2 

0=2  (although  the  plotted  run  was  a little  better  than  average) . 
In  Figures  3.5  to  3.7,  it  appears  that  the  An  are  settling 

near  zero.  Their  behavior  for  n < 25,000  is  similar  to  that  in 

2 

Figure  3.13.C,  which  is  for  the  data  of  Figure  3.6,  but  with  a = 


2. 
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Note  also  Figure  3.13a,b,  which  plot  (Xn>,  n 1 25,000,  for  cases 
with  the  f C * ) of  Figures  3.1  to  3.4. 

In  the  absence  of  a convergence  proof,  we  make  no  general 
claim  for  the  value  of  the  algorithm  for  non-convex  problems,  but 
it  is  interesting  to  observe  its  behavior  on  selected  examples. 

Figure  3.8  illustrates  the  fine  behavior  when  the  obser- 
vation is  f (X  ) + noise, 
x n 

A number  of  runs  were  taken  with  observation  noise  on  the 

constraints  also  tfor  which  case,  the  convergence  proof  remains 

valid).  In  (3.1),  replace  q (X  ) by  the  noise  corrupted 

xi  n 

estimate 


(3.3) 


q.  (X  +e.c  ) - q.  (X  -e.c  ) . - i b . 

n in  v n i n'  i,n  yi,n 


and  q^(XR)  by  the  average  of  the  noisy  observations 
^i ^xn  1 eicn^  + ^i,n 


C3.4) 


h £ Iqi(Xntei°n>  + «\±,n1 

1 f ~ 


where  the  tot  are  the  observation  noises  on  q.(X  ±e.c  ).  In 
Ti,n  l n in 

the  simulations,  the  {^t  } were  independent  of  the  {£  } 

1 1 n.  n 

(although  it  was  not  theoretically  necessary  to  do  so) . Also 
{^i  n^  was  se^-ecte<^  to  an  independent  sequence  of  zero  mean 
Gaussian  random  variables. 


Refer  to  Figure  3.9.  The  results  are  not  significantly 

different  than  for  the  case  where  iJk  = 0,  a rather  encouraging 

i f n 

sign  - since  there  are  numerous  practical  problems  with  constraints 

of  unknown  form,  but  where  noise  corrupted  observations  can  be 

taken.  The  data  are  particularly  interesting  since  (apart  from  some 

rather  inefficient  penalty  algorithms  [3],  [14]),  the 

Lagrangian  method  is  the  only  one  known  to  converge  when  there 

is  observation  noise  on  the  constraints.  If  qx (Xr ) + noise 

and  q(.Xn)  + noise  were  observed  in  lieu  of  (3.3),  (3.4),  then 

the  results  were  virtually  the  same  as  with  n = 0,  provided 

that  the  noise  corrupted  finite  difference  estimate  <SYn  was 

still  used  in  order  to  estimate  f (X  ) . If  f (X  ) + noise, 

x n x n 

qx(Xn)  + noise,  q(Xn)  + noise,  were  observed,  then  the  main 

effects  of  the  "constraint"  noise  seemed  to  be  similar  to  the 

effects  due  to  increasing  the  performance  measurement  noise 

somewhat,  and  the  iterations  still  converged  well.  Owing  to 

the  dependence  on  (X  } of  the  effects  of  the  noise  (i|>7  }, 

n l , n 

the  exact  effects  of  this  noise  are  more  difficult  to  understand, 
and  we  do  not  have  a convergence  rate  result  in  this  case. 

Figures  3.10  to  3.12  depict  results  for  a case  where  two 
constraints  are  active  at  0.  Again,  if  fv(X  ) + noise  is  observed 
(Figure  3.10),  the  path  behavior  is  excellent.  A comparison 
of  Figures  3.11  and  3.12  shows  the  effect  of  quadrupling  the 
noise  variance.  The  value  = 2 which  was  used  may  be 
slightly  too  small  for  constraint  q^*),  as  suggested  by  the 
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fact  that  U^}  stays  at  the  value  2 for  a rather  long  time. 

But,  since  convergence  (at  least  to  a point  very  near  the  con- 

• . . 2 
strained  minimum)  seems  to  occur,  and  since  An  eventually 

2 

starts  to  decrease,  perhaps  = 2 is  not  too  small.  Probably, 

a reasonable  adaptive  procedure  for  adjusting  the  would 

2 

have  increased  slightly. 

A comparison  of  tihe  Lagrangian  algorithm  with  those  of 
Sections  4 and  5 appears  in  Table  5.1. 

The  effect  of  A^.  Let  f(*),q(*)  be  as  in  Figure  3.1, 
and  let  Ax  = A calculation  of  the  eigenvalues  ( p ) of  the 

-K2  above  (3.2)  yields:  for  A^  = 1,  = .396,  p2  and 


P3  = .245  ± i .459;  for  A^  = 5,  p^  = .395,  p2  and 

p3  = .245  ± i 1.39.  For  the  situation  of  Figure  3.5,  for 


> 

>* 

II 

H* 

we  have 

P^  — • 322 , 

p2 

and 

p3  = .127  ± 

i 

.643  and  for 

aa  5' 

we  have 

Px  = .308, 

P2 

and 

p3  = .134  ± 

i 

1.493.  For 

fixed  A , the  value  of  A..  had  negligible  effects  on  the  real 
X A 

parts  of  the  eigenvalues,  but  as  A^  increased,  the  imaginary 
parts  increased;  hence,  the  "frequency"  of  oscillation  of 
{ An> , {Xn}  increased,  for  large  n.  This  "frequency",  result 
holds  because,  under  a suitable  time  scaling,  (Xn,Xn)  have  the 
same  "oscillatory"  properties  as  do  (3.2);  see  [10].  This 
remark  is  consistent  with  our  observations,  as  discussed  above. 
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4.  An  Algorithm  for  Inequality  Constraints 
(Algorithm  3 of  [81 


The  augumented  multiplier  type  algorithms  for  inequality 
constraints  (here  and  in  Section  5)  are  more  complicated  than 
the  previous  algorithms.  On  the  whole,  they  worked  surprisingly 
well,  even  when  the  conditions  for  convergence  were  violated. 

The  algorithm  here  is  an  adaptation  of  that  of  Section  2.  All 
undefined  terms  retain  their  definitions  from  the  previous  sections. 

~ r ~2 

Define  qj[(x)  = max  [ 0 , q±  (x)  ] , P (x)  = l q±(x),  and  let 
$ (x)  denote  the  Jacobian  of  the  vector  (q^(x),  all  i such  that 
q^  (x)  > 0}.  Let  $n  = $>(Xn).  The  constraints  q^  ( • ) are  known, 
but,  again,  f(*)  is  known  only  via  the  noise  corrupted  observations . 
Define  (A  will  be  defined  below) 


(4.1} 


n+1 


= X 


n 


- a [ <5Y 
n n 


$ 'X_ 

n n 


irW1 


:i 


The  algorithm  of  Section  2 can  be  written  as  (4.1),  where 
replaces  q^  and  where  An  is  the  minimizing  A in 

1 


2 * * 
(4.2)  min|6Yn  + $^A | . 


onto  the  positive  cone 
Then  (4.1)  become  s 


{y:  y = -l  ^q.  v(x),  xl  > o,  q.  (x)  > o}. 

. ± f X.  “ 1 


(4.3) 


X , . = X - a [ir+  (X  ) 5Y  + j PCX)], 
n+l  n n n n l x n 


and  the  relationship  to  (2.1)  becomes  apparent.  The  analysis  here 
is  more  complicated,  since  tt+(x)  is  not  linear.  In  our 
simulations,  we  used  one-sided  differences,  and  replaced  6Yn  by 


5V 


It  is  proved  in  [8]  that,  under  various  conditions,  Xn  -*■  6 


where  0 is  a Kuhn-Tucker  point  for  the  constrained  optimization 
problem.  No  asymptotic  rate  results  are  currently  available. 

For  our  simulated  problems,  all  the  conditions  for  con- 
vergence were  satisfied,  except  for  (A6)  of  [8j.  (The  tt+(x)  and 
tt+(x)  of  (A6)  of  [8]  are  actually  the  same.)  The  finite  difference 
form  of  this  condition  is:  there  is  a real  k^  > 0 such  that 

+ V2cn>  1 x0 xn-l)  i kl£i< V"+<V  £x  (Xn > 

Cln  the  forward  difference  case  2c  is  replaced  by  c . ) In  our 

n n 

case,  the  condition  is  not  satisfied  near  the  Kuhn-Tucker  point. 

Yet,  numerical  convergence  seemed  to  have  always  occurred.  The 
point  will  be  developed  elsewhere,  but,  from  preliminary  work,  it 
appears  that  suitable  (non-obvious)  adaptations  of  the  weak  con- 
vergence methods  in  [9]  yield  a convergence  proof,  under  conditions 
which  hold  here.  We  note  only  the  following:  for  each  cn,  a 
vector  field  on  Rr  is  drawn,  where  the  direction  of  the  flow  line 
at  x is  the  average  direction  of  movement  of  Xn+1  - Xn,  given 
that  Xn  = x.  Then,  as  n -*■  °»,  the  flows  tend  to  a flow,  whose 
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lines  "flow"  to  6.  On  the  boundary,  there  is  a special  problem, 
since  the  average  directions  are  discontinuous,  and  in  the  limit, 
the  "average"  flow  is  of  a "chattering"  type.  So  far,  the 
technique  of  proof  of  actual  convergence  does  not  appear  to  be 
simple,  and  it  will  be  developed  elsewhere.  The  need  for  the 
generalization  was,  in  fact,  suggested  by  the  numerical  study. 

Experimental  data.  A few  selected  typical  runs  are  plotted 
in  Figures+  4.1  to  4.4.  The  method  appears  to  be  quite  robust. 

Apart  from  the  usual  SA  sequences  (an,cn>,  we  need  only  select  k. 

As  in  Section  2,  the  larger  values  of  k appear  to  give  better 
results  than  the  smaller  values.  For  k in  the  range  of  .5 
to  4,  the  differences  in  final  values  were  quite  small.  Several 
runs  were  taken  using  the  £(•)  of  the  Lagrangian  Figure  3.1. 

Unlike  the  Lagrangian  case,  this  procedure  often  got  caught  at  the 
local  min  0^,  when  the  initial  point  was  to  the  left  of  0^. 

Inside  the  constraint  set,  the  procedure  is  a classical  SA 

gradient  procedure.  When  outside,  the  penalty  terms  P (X  ) pull  the 
path  to  the  constraint  set.  There  is  also  another  type  of  "pull"  to 
this  set.  Refer  to  Figure  4.5,  where  we  consider  a case  where 
there  is  only  one  constraint.  The  vectors  v^,  i < 4,  represent  6Yn 
6Yn*  Recall  the  definition  of  tt(x)  (equality  constraint  case). 
Then  tt(x)  applied  to  v1  or  v3  is  v,.  and  to  v2  or  v4 
is  vg,  while  tt+(x)  applied  to  v3  or  v4  are  v5  and  vg, 

+Here  and  in  Section  5,  6 = (-2.7,-0.427),  f ( 9 ) = .255. 


or 


2' 


resp.,  and  tt+(x)  applied  to  w1  and  v2  are  v1  and  v2, 
resp.  For  this  reason,  the  -rr + (XR ) 6 Yn  (or  -tt  + CXn ) 6 Yn ) term 
in  C4.3)  tends  to  create  a "pull"  in  the  direction  of  the  con- 
straint surface.  The  effects  of  this  can  be  seen  in  Figure  4.3 
at  iterates  8 to  11,  and  in  Figure  4.4,  after  iterate  15. 

In  fact,  at  iterate  15  in  Figure  4.3,  fx(Xn)  % 0,  and  both  the 
biased  projections  of  the  random  observation  noise  and  the  penalty 
term  px^Xn)  pull  the  path  directly  to  the  constraint  set. 

Many  types  of  adaptive  modifications  are  possible.  For 
example,  the  value  of  an  could  depend  on  whether  or  not  some 
constraint  was  satisfied  or  not.  Observe  that  the  variations 
in  tend  to  be  greater  (when  the  sequence  is  inside  the 

constraint  set)  in  the  vertical  direction  - for  rather  obvious 
reasons.  Once  the  sequence  values  improve  enough  in  the  vertical 
direction  so  that  they  leave  the  constraint  set,  then  the  pro- 
jection and  penalty  effects  come  into  play,  and  the  path  then 
moves  toward  the  optimal  point  while  staying  close  to 
the  boundary.  In  the  Lagrangian  method,  it  is  only  the  "prices" 

X , which  keep  the  {X  } feasible  or  nearly  feasible, 
n n 

Table  4.1  gives  sample  averages  for  20  runs  of  500 

2 

iterations  each  on  a 5 dimensional  problem,  where  f (x)  = (x^-1)  + 

.5(x2-2)2  + .8(x3-1)2  + .4(x4+j)2  + .2(x5+|)2,  q1  = x2/4  + x2/2  + 

X2  + x2  + x2  - 1,  q2  = x2  - x2,  q3  = 4x3x4  - 1,  q4  = -x4 , q5  = -x,-. 
one-sided  differences  were  used  and  0 = (1,0,  j,  0,0),  q(0)  = 

(0 , 0, -1, 0, 0) , k = 4.  The  behavior  for  other  initial  points  was 
similar,  generally  improving  (becoming  worse)  as  Xq  was  moved 
closer  to  (further  from)  0. 
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A comparison  with  other  algorithms  appears  in  Table  5.1. 

The  runs  analogous  to  the  Lagrangian  runs  in  Figures  3.11  and  3.12 


also  behaved  well,  indeed  better  than  the  Lagrangian  runs  - 
convergence  was  faster.  Similar  runs  for  the  algorithm  of  the  next 


section  also  performed  better  than  the  Lagrangian  runs. 


5.  An  Alternative  Algorithm  for  the  Inequality  Case 


(Algorithm  4 of  [ 8 ] ) 


This  (more  complicated)  algorithm  is  the  result  of  a 

direct  attempt  to  handle  the  inequality  problem  by  converting 

it  into  an  equality  constrained  problem  via  the  addition  of 

slack  variables,  and  adapting  the  method  of  Section  2,  from 

which  some  of  the  notation  comes.  Let  {wn,vn>  denote  positive 

null  sequences.  Let  b denote  a positive  number.  Define 

<)>^(x,z)  = q^(x)  + bz1,  where  bz1  is  the  ifc^  slack  variable. 

Let  Zn  = (Z^,...,Z®)  denote  the  n*"*1  estimate  of  the  optimal 

2 

slack  variable  vector,  -q(0)/b,  and  define  P(x,z)  = £ 4>T(x,z). 

i 

Define  <J>  = $ (X  ) , where  $ (x)  = [<)>.  (x),...,<j>  (x)  J ' , 

II  II  if  X Sf  X 

as  earlier,  and  define  (Xn  will  be  defined  below) 


(5.1) 


Xn+1  = Xn 


a [6Y  + 4> ' X + k/2  P (X  , Z ) ] 

n n n n x n n 


If  Zn  > vn,  set 


mean  values  of  20  runs,  each  of  500  iterations 
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(5.2a) 


Cl  ■ ma*t°'zn  - 


If 


K < Vri  f 

n — n 


use  (5.2b,c  or  d)  according  to  the  case. 


(5.2b) 

(5. 2c) 
(5. 2d) 


Cl  * Zn  - VbCkVXn'V>'  i£  *i(xn'V  i 0 

and  bA1  + k<J> . (X  , Z ) < 0, 
n i n n — 

Cl  - Zn  i£  *i<VZn>  i °->C  + k+i<Xn'Zn>  > °- 


Cl  * ZC  Vi,  i£  *i(Xn'Zn>  " °- 


The  value  of  \n  is  selected  by  a type  of  projection  [ 8 ] 
(in  x,z  space),  similarly  to  what  was  done  in  Section  2 in 
x space).  The  method  is  equivalent  to  selecting  Ar  to  be  a 
minimizing  vector  (unconstrained  in  sign)  in 


(5.3) 


min 

A 


+ b 


I (A1) 


:Z1>v 
n n 


2 


(in  the  forward  difference  case,  which  was  simulated,  iSY^  replaces 

6Y  in  (5.1)  and  (5.3)). 
n 

The  algorithm  seems  more  involved  than  it  is,  and  works 

well.  Under  conditions  which  hold  here  (unless  otherwise 

mentioned)  Xn  converges  to  a Kuhn-Tucker  point  9.  The  value 

of  v determines  an  a priori  level  of  the  slack  variable  bZ1 
n n 

at  which  the  ith  constraint  is  presumed  to  be  active. 
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If  a.  (• ) is  inactive  at  X , then  Z , is  calculated  in  the  same 

n n+i 

way  that  xn+^  is  calculated. + If  < vn , the  rule  is  more 

complicated.  Note  that  (although,  for  technical  reasons  which 

can  probably  be  circumvented,  £ (an/vn) ^ < 00 ' some  & > 0,  was 

required  in  the  proof  in  [ 8 ] ) if  vn  E 0 in  our  simulations, 

the  results  were  as  good  as  or  better  than  those  for  the  other 

cases.  The  (wn>  sequence  and  the  X^  in  (5.2a,b)  serve  the 

purpose  (among  others)  of  assuring  that  {X^}  cannot  converge 

to  a point  W where  some  multiplier  X (in  the  Kuhn-Tucker 

condition)  would  be  negative.  If  Xn  is  near  such  a point  0, 

then  (loosely  speaking)  the  general  average  negative  value  of 

X^  (for  large  n)  would  force  X away  (if  d> . (X  ,Z  ) < 0)  and 
n 3 n J l n n - 

the  iteration  (5. 2d)  (together  with  the  penalty  effect,  forcing 

<b.(X  ,Z  ) to  be  small)  would  do  the  same  thing  if 
inn 

4> . (X  ,Z  ) > 0. 

Ti  n n 

The  calculation  (5.3)  can  also  be  viewed  as  a type  of 

projection  of  6Y^  onto  the  tangent  plane  to  the  surface 

(y:  <3-  (y)  -q.(X)=0,  i<s}  at  y = X,  but  where  the  weights 
l in  — n 

of  the  "inactive"  constraints  are  penalized  by  the  additional 
quadratic  term,  so  that  they  will  not  be  too  large.  If  all  X1 
are  "penalized",  then  Xn  will  be  "relatively"  small,  and  (5.1) 
behaves  more  like  a gradient  procedure. 


r . . • 

Note  that  X^b  is  just  ^n^i  ^Xn'  Zn^  ^^Zn'  and  that  ^Yn  + ^n^n 
is  an  estimate  of  gradx(f(x)  + l X^<()^(x,z)]  at  x = Xn,  z = 
The  two  expressions  are  similar;  since  f(*) 
z,  no  6Yn  need  appear  in  (5.2). 


does  not  depend  on 
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In  our  simulations,  we  let  k and  b be  variables.  Replace 

bx^  and  k<t>i^xn'zn)  bY  bi^Xn^n  and  ki  ^Xn^  ^Xn'  Zn^  ' where 
b^  (x)  = b1  (resp.  b2)  if  q^(Xn)  < 0 (resp.,  > 0)  , and  set 

k^  (x)  = k1  (resp.,  k2)  if  q^(Xn)  < 0 (resp.,  > 0).  Also,  we 
used  6Yn  and  set  bZ^  = max [0,-q^ (XQ) ] , and  A = ^ , a = § 


. 2 1 --4  „ -1/6 

6 = 3,  Y = g , wn  = n , vn  = Vn  , c = 


n 


2n 


1 2 _ _ 
I 76  ' a " 2* 


Numerical  results.  The  general  qualitative  features  are 

illustrated  in  Figures  5.1  to  5.6.  Note  the  behavior  from  iterate 

20  to  iterate  80  in  Figure  5.1,  and  a similar  behavior  in 

Figure  5.2.  The  looping  is  caused  by  the  bias  in  the  finite 

difference  estimate  6f(X  ,c  ) of  f (X  ) . Recall  that  our 

n n x n 

bias  is  quite  large.  Here,  it  is  just  big  enough  at  iterate  20 

1 . 

to  reverse  the  sign  of  E[Xn|  xn  ^ X20^  from  positive  to  negative. 
So  the  forms  (5.2a,  b)  increase  the  value  of  Z^  on  the  average, 
hence  the  penalty  term  in  (5.1)  forces  X^  inside  the  feasible 
set  (q^(Xn)  must  decrease).  Eventually,  the  bias  decreases. 

The  point  is  important,  since  the  behavior  of  an  algorithm  can 
depend  heavily  on  the  form  of  the  available  data,  as  well  as  on 
the  algorithm  itself.  With  less  bias,  all  the  algorithms  would 
have  behaved  better. 

Generally,  larger  wfi  does  better  than  smaller  wn.  As 
indicated  earlier,  (wn)  plays  an  important  role  in  pushing  the 
{ Xn } sequence  away  from  boundary  points  (toward  the  interior). 

If  the  average  X^  values  are  all  positive,  the  Xn  will  drift 
back  to  the  boundary  point.  If  the  average  X*  are  not  all  > 0, 
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the  w^  and  play  the  role  of  pushing  away.  (When 

> 0,  for  "penalty"  reasons,  we  do  not  want  to  increase 
Zn,  but  we  do  want  to  decrease  qi  (X  ) . ) As  the  elements  of 
{wn>  increase,  the  path  { } tends  to  be  more  on  the  inside  of 
the  boundary  (if  0 is  on  the  boundary)  than  on  the  outside,  as 
it  moves  to  0 . 

The  {X^}  has  a tendency  to  move  along  the  boundary  (see 
the  typical  Figure  5.6),  owing  to  the  method  of  adjusting  Zn, 
and  to  the  penalty  term.  This  is  especially  true  if  k is 
large  (compare  Figure  5.2  with  5.6).  With  k = 1,  the  procedure 
converged  much  more  slowly.  But,  if  k is  too  large,  then  it 
appears  that  the  greater  penalty  puts  too  much  emphasis  on  the 

convergence  to  zero  of  P (X  ,Z  ),  which  seems  to  slow  down  the 

n n 

actual  path  convergence.  Figures  5.3  and  5.4,  for  the  same  para- 
meters, illustrate  great  differences  due  to  different  noise.  If 
b1  = b2  = 1,  convergence  was  slow.  If  b.^  > b2,  then  if  a con- 
straint is  satisfied,  it  plays  a smaller  role  in  the  calculations  of 
An,  and,  interior  to  the  constraint  set,  the  procedure  behaves  more 
like  the  gradient  SA  procedure,  as  it  should. 

Now,  compare  Figures  5.1  and  5.4,  the  data  is  the  same 
except  for  the  noise  sequence  and  the  value  of  (k^,k2).  In  the 


run  of  Figure  5.4,  the  signs  of 

the 

4> . (x  ,z  ) , 

Ti  n n ' 

n = 2,  3 

were  such 

that  the  cf>.  (X  , Z ) • 4> . (X  , Zn) 
l , x n n l n n 

terms,  i = 2,3 

, in  P 

(X  ,Z  ) 

X 

n n 

partially  offset  the  effect  of 

the 

<f>,  (X  ,Z  ) 

1 , x n n 

terms , 

until 

n = 80.  Thus,  the  path  remained  "outside"  rather  long.  This  *:vpe 

of  phenomenon  was  a common  occurrence  if  k^  was  too  small  - 'or 

then  the  values  of  <f>  • (X  ,Z  ) (for  the  i such  that  q.  (X  ) 

inn  n 


0) 
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were  not  always  close  enough  to  zero  for  their  effect  on  Px(Xn,Zn) 
to  be  negligible. 

Two  paths  in  the  plotted  graphs  were  typical  - although 
the  final  values  were  slightly  better  than  average.  The  paths 
are  better  behaved  than  those  for  the  previous  two  algorithms. 

The  procedure  is  not  overly  sensitive  to  variations  in  b,k  and 
A,  within  a reasonable  range. 

Finally,  refer  to  Table  5.1,  where  several  algorithms 
are  compared,  via  sample  averages  and  variances  of  20  runs. 
Generally,  the  algorithm  of  this  section  performed  best,  and  the 
Lagrangian  worst.  The  data  for  initial  point  (3,-3)  are 
comparable  in  magnitudes  and  qualitative  properties,  except  that 
ftXsoo)  °f  1,2  of  the  last  algorithm  were  .386  and  .373, 
resp..  The  constraints  are  satisfied  reasonably  well.  For  the 
configuration  and  initial  point  of  Figure  3.1,  only  the 
Lagrangian  avoided  (and  consistently)  being  caught  by  the  local 

maximum  0^.  If  the  last  algorithm  of  this  section  is  to  be  used, 
we  would  recommend  vn  = 0.  The  table  illustrates  the  robustness 
of  the  various  algorithms,  under  parameter  variations  (within 
reasonable  ranges) . 


6 . Conclusions 

The  four  algorithms  all  work  well,  within  certain  ranges. 
The  equality  case  method,  and  those  of  Sections  4,  5 are  robust 
and  reliable. 
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crz=  2,  k = 1/2 , 8y„  used 

Fig.  2.1  Equolity  constraint 


arz~Z , K = 1/2 , syn  used 
Fig.  2.3  Equolity  constroint 
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i g.  3. 13  Variation  of 


O’  =1/2,  k = I,  8 y„  used 
Fig.  4.1  Positive  cone 

projection  olgorithm 


projection  algorithm 


cr  = 2,  k = 4 , Sy„  used 
Fig.  4.3  Positive  cone 

projection  algorithm 
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